Literature DB >> 32707928

LARa: Creating a Dataset for Human Activity Recognition in Logistics Using Semantic Attributes.

Friedrich Niemann1, Christopher Reining1, Fernando Moya Rueda2, Nilah Ravi Nair1, Janine Anika Steffens1, Gernot A Fink2, Michael Ten Hompel1.   

Abstract

Optimizations in logistics require recognition and analysis of human activities. The potential of sensor-based human activity recognition (HAR) in logistics is not yet well explored. Despite a significant increase in HAR datasets in the past twenty years, no available dataset depicts activities in logistics. This contribution presents the first freely accessible logistics-dataset. In the 'Innovationlab Hybrid Services in Logistics' at TU Dortmund University, two picking and one packing scenarios were recreated. Fourteen subjects were recorded individually when performing warehousing activities using Optical marker-based Motion Capture (OMoCap), inertial measurement units (IMUs), and an RGB camera. A total of 758 min of recordings were labeled by 12 annotators in 474 person-h. All the given data have been labeled and categorized into 8 activity classes and 19 binary coarse-semantic descriptions, also called attributes. The dataset is deployed for solving HAR using deep networks.

Entities:  

Keywords:  attribute-based representation; dataset; human activity recognition; inertial measurement unit; logistics; motion capturing

Mesh:

Year:  2020        PMID: 32707928      PMCID: PMC7436169          DOI: 10.3390/s20154083

Source DB:  PubMed          Journal:  Sensors (Basel)        ISSN: 1424-8220            Impact factor:   3.576


1. Introduction

Human activity recognition (HAR) assigns human action labels to signals of movements. Signals are time series that are obtained from video-frames, marked-based motion capturing systems (Mocap), or inertial measurements. This work focuses on HAR using Mocap and inertial measurements. Methods of HAR are critical for many applications, e.g., medical and rehabilitation support, smart-homes, sports, and in industry [1,2,3]. Nevertheless, HAR is a complicated task due to the large intra- and inter-class variability of human actions [1]. In addition, extensive annotated-data for HAR is scarce. This is, mainly, due to the complexity of the annotation process. Moreover, datasets of HAR are likely to be unbalanced. Usually, there exists more samples of frequent activities, e.g., walking or standing in comparison with picking an article [4,5]. Warehousing is an essential element of every supply chain. The main purpose of warehousing is storing articles and satisfying customers’ orders in a cost and time-efficient manner. Despite an increase in automation and digitization in warehousing and the impression of a shrinking number of employees, the employee numbers are rising [6,7]. Manual order-picking and -packing are labor-intensive and costly processes in logistics. Information on the occurrence, duration, and properties of relevant human activities is fundamental to assess how to enhance employee performance in logistics. In the state-of-the-art, manual activities of the employees are mostly analyzed manually or analytically, using methods such as REFA Time Study [8] or Methods-Time Measurement (MTM) [9]. The potentials of sensor-based human activity recognition (HAR) in logistics are not yet well explored. According to Reining et al. [10], in the past ten years only three publications dealt with HAR in logistics [11,12,13]. One major reason for this is the lack of freely accessible and usable datasets that contain industrial work-processes. This is because industrial environments such as factories and warehouses pose a challenge for data recording. Regulations such as the European General Data Protection Regulation [14] further create barriers when handling sensitive data, such as the work performance of employees and their physical characteristics. Thus, scientists tend to fall back to pseudo-industrial laboratory set-ups for dataset creation. The closeness to reality of these low-scale laboratory set-ups is often questionable. For example, the recognition performance in [10] suffered from the recording procedure that split a workflow into activities that were each recorded individually. This was the case because the transitions between activities were not examined properly. In the ’Innovationlab Hybrid Services in Logistics’ at TU Dortmund University, manual processes of a real-world warehouse are replicated on an area of 220 m [15]. Fourteen individuals carry out picking and packaging activities in three scenarios under real-life conditions. All activities are recorded in a sensor-rich environment, using an Optical marker-based Motion Capture (OMoCap) system, three sets of inertial measurement units (IMU), and an RGB camera. All data streams are synchronized. In total, LARa contains 758 min of data. Twelve annotators labeled the OMoCap data in 474 person-h (PHR). A subsequent revision took 143 PHR for 4 revisers. Data are labeled using 8 activity classes and 20 binary coarse-semantic descriptions. These descriptions will be denoted as attributes [16]. Traditional methods of statistical pattern recognition have been used for HAR. These methods segment signal-sequences using a sliding-window approach, extract relevant features from the segmented sequences and train a classifier for assigning certain action labels. Recently, deep architectures have been used successfully for solving HAR problems. They are end-to-end architectures, which are composed of feature extractors and a classifier. They combine learnable convolutional operations with non-linear functions, downsampling operations, and classification layer [2,3,17,18]. These architectures map sequence segments of measurements from multichannel sensors into a single class or a semantic-based representation [16]. Stacked convolution and downsampling operations extract abstract and complex temporal relations from these input sequences. Attribute-based representations have been deployed for solving HAR. Attributes describe activity classes semantically [16]. For example, handling can be represented by moving the left, right, or both hands, and by its pose based on a picked article. Right-hand, left-hand and box can be considered as attributes. Attributes are used for sharing high-level concepts among activity classes. They are an additional mapping between sequence measurements of the data streams and activity classes. In [13,16], a single combination of attributes represents an activity class. Nevertheless, this limits the properties of attribute representations. As human actions vary, they could be represented by different combinations of attributes. This paper introduces a novel and large dataset for HAR in the context of logistics. This dataset contains annotations of human actions and their attributes in the intra-logistics. This paper explains in detail the recording scenarios, sensors settings, and the annotation process. In addition, it presents the performance of employing deep architectures for solving HAR on the provided dataset. It describes an approach for adapting deep architectures to solve HAR using attribute representations. For the dataset, the detailed annotation of these attributes leads to a total of 204 unique attribute representations for the 8 activity classes. This high level of granularity is the prerequisite for evaluating different activity recognition approaches. The LARa dataset contains labeled IMU and OMoCap data, the respective RGB videos, the recording protocol as well as the annotation and revision tool. All data are freely accessible: https://doi.org/10.5281/zenodo.3862782. The contribution answers also the following research questions in the context of the first freely accessible logistics HAR dataset—Logistic Activity Recognition Challenge (LARa): What is the the state-of-the-art of dataset creation for multichannel time-series HAR? What guidelines are proposed for creating a novel dataset for HAR? What are the properties of a logistics-dataset for HAR created by following these guidelines? How does a tCNN perform on this dataset using softmax compared to an attribute representation? This contribution is organized as follows. Section 2 presents the related work on multichannel-time series HAR. In Section 3, the freely accessible dataset LARa is introduced. First, data recording steps in the logistics scenarios are presented. Second, the activity classes and semantic attributes are explained. Third, findings of the annotation and revision process are highlighted. Section 3 concludes with an overview of the LARa dataset. Section 4 presents an example of solving HAR on the LARa dataset using deep architectures. Finally, Section 5 offers a discussion and the conclusions of the work in this contribution. Additionally, Appendix A gives an overview of state-of-the-art datasets for HAR. Based on the datasets’ descriptions, the guideline for creating the novel dataset in Section 3 is derived.

2. Related Work

Methods of supervised-statistical pattern-recognition have been used successfully for HAR. The standard pipeline consists of preprocessing, segmentation, statistical-features extraction, and classification. High and low-pass filters are common as preprocessing steps. High-pass filters serve denoising, as faulty measurements in the sensors are on the high-frequency spectrum. In addition, changes in human motions are rather in the low frequency. Low-filter operations are used for separating gravitation and inclination of the IMUs in constant space, i.e., the earth [19]. A segmentation approach, e.g., a sliding window, divides the input signal into segments of a certain time duration. Statistical features are computed from the time and frequency domain. They are, for example, the mean, variance, channels-correlation, entropy, energy, and coherence. [1,11,20,21]. The authors in [10] present a summary of such features. Using these features, the parameters of a classifier are computed. The classifier assigns an activity label to an unknown input. Some examples of classifiers are Naïve Bayes, Support Vector Machines (SVMs), Random Forests, Dynamic Time Warping (DTW), and Hidden Markov Models (HMMs) [11,22]. These methods, however, might show low performance on challenging HAR problems. In addition, different combinations of features must be selected manually per activity. This makes the method hardly scalable and is prone to overfitting [3,19]. The authors in [11] evaluate HAR for the order picking using statistical pattern recognition. They present a novel dataset of human order picking activities. They use a low number of sensor devices. Specifically, they deployed three inertial measurement units (IMUs), which are worn by workers in two different scenarios. They computed handcrafted-statistical features on segments that were extracted from the sliding window approach. The authors evaluated three classifiers, namely, an SVM, a Naïve Bayes, and a Random Forest. The authors in [19] solve HAR for activities on daily living. They compute statistical features on three data streams, namely the raw inertial-measurements, their AC and DC components. They propose a hierarchical approach with bagging performance of simple classifiers on a different combination of device locations on the human body. Deep architectures have been also deployed for solving HAR. Temporal Convolutional Neural Networks (tCNN), Recurrent Networks (RNN), e.g., Long Short-Term Memory (LSTMs), and a combination of both are examples of architectures in the field. tCNNs are hierarchical architectures that combine the feature extraction along with time and classification in an end-to-end approach. They learn the features and parameters of the classifier directly from raw data. tCNNs are presented in [17,18,23]. They are composed of convolution and pooling operations that are carried out along the time axis. tCNNs exploit their hierarchical composition becoming more discriminative concerning human actions. The combination of stacked convolutional and pooling layers find temporal relations that are invariant to temporal translation. They are also robust against noise. Moreover, these architectures share small temporal filters among all the sensors in the IMUs. Local temporal-neighborhoods are likely to be correlated independent of the sensors’ type. The authors in [2] introduce an architecture that combines temporal convolutions and LSTMs layers replacing the fully-connected layers. LSTMs are recurrent units with memory cells and a gating system, which are suitable for learning long-temporal dependencies in sequences. These units do not suffer from exploiting or vanishing gradients during training. The authors in [24] utilize a shallow recurrent network; namely, a three-layered LSTM and a one-layered bidirectional LSTM. Bidirectional LSTMs process sequences following their inputs in both forward and backward directions. The performance of the BLSTMs outperforms the convolutional architectures. Nevertheless, tCNNs show more robust behavior against parameter changes. The authors in [3] propose a tCNN that is adapted for IMUs, called IMU-CNN. The architecture is composed of convolutional branches corresponding to each IMU. These branches compute an intermediate representation per IMU. They are then combined in the last fully-connected layers. The authors compared IMU-CNN with the tCNN and a tCNN-LSTM, similar to [2]. The IMU-CNN shows a better performance, as it is more robust against IMU’s faults and asynchronous data. The authors in [20] investigate the effect of data normalization on the deep architecture’s performance. They compare the normalization to zero-mean and unit standard deviation, batch normalization, and a pressure-mean subtraction. The architecture’s performance improves when utilizing normalization techniques. Extending the work of [3], the authors use four sensor fusion strategies. They find that late fusion strategies are beneficial. Additionally, they evaluate the robustness of the architectures concerning proportions of the training dataset. The authors in [16] propose using attribute-based representation for HAR. In object recognition and word-spotting problems, attributes are semantic descriptions of objects or words. They represent coarsely a class. In [12], a search for attributes is presented, as there are no datasets with such annotations. The selected attributes are better suited for solving HAR. For such a search, the authors deploy an evolutionary algorithm. Firstly, they assign random binary representations to action classes as population. Secondly, they evaluate a population using deep architectures with a sigmoid activation function. The validation’s performance serves as evolution fitness. The authors deploy non-local mutations on the populations. They conclude that using attribute representations boosts the performance of HAR. Even, a random attribute-representation performs comparably to a directly classifying human actions. A drawback of this approach was the lack of a semantic definition of the attributes. Attribute-based representations have been deeply explored on HAR in [13]. Particularly, in the manual order picking process, attribute representations were expected to be beneficial for dealing with the versatility of activities. This contribution compared the performance of deep architectures trained using different attribute representations, and it evaluated their quantitative performance as well as their quality from the perspective of practical application. Expert-given attribute representations performed better than a random one, created following the conclusions in [16]. A semantic relation between attributes and activities enhances HAR not only quantitatively with regards to performance, but it also ensures a transfer of the attributes between activities by domain experts. In this preliminary work, the mapping between activity classes and attribute representations was one-to-one. This became a multiclass problem that limits the benefits of attribute-based representations. An important element of these supervised methods is annotated data [20]. A drawback of using deep methods is the need for extensive annotated data. This contrasts against the statistical pattern recognition. However, capturing and annotating data for HAR is laborious and expensive. Moreover, annotations regarding attributes are not existing. These fine-grained annotations represent an extra cost. In [13], human actions were given unique attribute representations. Nevertheless, human actions might include a different combination of attributes. Different combinations of attributes might be helpful for zero-shot learning and reducing the effects of the unbalanced problem. They also might allow clustering signals of a certain activity but with slight changes in the human movements. So far, there is no large-scale, freely accessible dataset of human activities in complex, industrial processes; neither using attributes. In addition, there are not standard guidelines for creating such a dataset. Thus, it needs to be defined beforehand. A review of existing datasets and their shortcomings in regards to the goal of this paper is presented in the appendix to further motivate the introduction of the new dataset in the following section.

3. Introducing the LARa Dataset

This section states the LARa dataset’s specifications. Requirements and specifications of LARa are based on a detailed review of datasets for HAR, see Appendix A. In particular, the origin of the laboratory set-ups, the subjects’ characteristics as well as the recording and annotation procedure are showcased. For data recording, the researchers created physical replicas of real-world warehouses in a laboratory. They are called scenarios in this contribution. This subsection gives insights into the replicas’ creation, and it explains the underlying warehousing processes. Next, the sensors’ configuration and the proper preparation of the subjects are presented.

3.1. Guidelines for Creating and Publishing a Dataset

The datasets, as discussed in the Appendix A, show no uniform guidelines for dataset creation. Based on this overview of the datasets and their description, a guideline for the creation of a dataset is derived. If possible, the recording should take place under real conditions. Realistic environments ensure recording natural movements, e.g., a real warehouse or a detailed replica. A replica requires a large laboratory. In addition, objects similar to the real scenarios are needed, e.g., a picking card. The subjects’ selection depends on the variety of people from the real environment, e.g., employees of a real warehouse. The selection terms involve age, sex, height, and handiness. In addition to the realistic environment, the behavior of the subjects should be implemented as naturally as possible. Instead of just recording individual activities in isolation, recording a whole process enables natural behavior and thus natural movements. A recording should therefore not only consist of one activity, e.g., lifting a box, but should occur as part of a process, e.g., . Using a recording protocol and RGB camera for documentation, discrepancies, such as the slipping of sensors or markers, are noticeable after the recordings. It is recommended to use different sensor types with a high frame rate. Since there is no uniform positioning of sensors, several sets of different positions on the human body can be experimented with. OMoCap and RGB videos could help in complex annotation-scenarios. The annotation is to be carried out by domain experts such as physiotherapists, dance teachers, or, in the case of logistics, logistics experts. As soon as several people annotate or are expected to benefit from the annotated data, an annotation guideline is necessary. A revision of the annotation is recommended to improve the quality of labeled data. To ensure other applications, the representation of the activity classes should be as granular as possible. The granularity depends on the number of activities and can be increased by a binary coarse-semantic description. Necessary general information such as location and period of the recordings must be specified. The method of data acquisition and the description of the activities are part of the description of the dataset. In addition to the method of annotation and its effort, the labeled activity classes must be described. The dataset should contain labeled and raw data from all sensors. Access to the annotation tool must be guaranteed for understanding the process of annotation.

3.2. Laboratory Set-Ups based on Logistics Scenarios

This subsection explains three logistics scenarios for data recording. The warehousing processes’ graphical representation is based on the guidelines defined by the Object Management Group [25]. The graphical and textual descriptions of the scenario guide researchers when applying methods of HAR that take context into consideration. A detailed explanation of the scenarios might be helpful for approaches involving context, preconditions and effects, e.g., Computational Causal Behavior Models (CCBM) [26]. This context may be the constraints of the warehousing process. For example, some activities can only be performed in a specific order or at a specific location and time. Data were recorded in physical set-ups created in a controlled environment—the ’Innovationlab Hybrid Services in Logistics’ at TU Dortmund University [15]. A group of researchers created the physical replica of warehousing scenarios following a cardboard engineering approach [27,28].

3.2.1. Logistics Scenario 1—Simplified Order Picking System

The first scenario is not based on a real warehouse. Nevertheless, this process may exist in reality. The process is illustrated in Figure 1, the physical laboratory set-up is presented in Figure 2.
Figure 1

Business process model of logistics Scenario 1—simplified order picking.

Figure 2

Physical laboratory set-up of logistics Scenario 1—simplified order picking.

In the beginning of an order-picking process, the subject places boxes on an empty order-picking cart. These empty boxes are provided at the base. In a real warehouse, this base may be a conveyor that transports empty boxes to the order picker while transporting full boxes to the shipping area. In the laboratory, stacking frames recreated the conveyor. This simplification does not influence human-motion behavior. The boxes and the cart were standard items that are common in the industry. Next, the subject moves the cart to a retrieval location. The researchers who guided the recordings specify where to go. An order-picking aisle was recreated by placing boxes on frames. When the subject arrives at a retrieval location, they pick articles from a box or they open a fronted bin. The subjects place the articles in an empty box on their cart. The articles were small, light items, such as bags of 500 g. This procedure of taking the cart to a new location and retrieving goods is repeated until all boxes on the cart are full. The subject takes the cart back to the base and places the full boxes on the conveyor. The order-picking process starts anew. When all articles in the aisles’ boxes are empty, the order-picking process has to end. The research team refills the boxes.

3.2.2. Logistics Scenario 2—Real-World Order Picking and Consolidation System

The second scenario is based on a real warehouse. Access to the site and process documentation was granted by industry partners of the chair of Materials Handling and Warehousing. In contrast to Scenario 1, the second scenario takes information technology processes such as scanning barcode labels or pushing buttons for pick confirmation into account. For the sake of clarity, the order-picking process and the consolidation process of the picked goods are illustrated separately in Figure 3 and Figure 4, respectively. The physical laboratory set-up of Scenario 2 is illustrated in Figure 5.
Figure 3

Business process model of logistics Scenario 2 (Part 1)—real warehouse order picking.

Figure 4

Business process model of logistics Scenario 2 (Part 2)—real warehouse order picking.

Figure 5

Physical laboratory set-up of logistics Scenario 2—real warehouse order picking.

The order-picking cart is bigger than the one used in the first scenario as visible in Figure 2 and Figure 3. It has three shelves of equal size that are filled with cardboard boxes of different shapes and sizes. Each box is held open with a rubber band. In the real warehouse, a so-called put-to-light (PtL) frame is attached to the cart. It gives a visual signal where to place articles and has buttons to press for retrieval and submission confirmation. Small calculators are attached to the cart to replicate this system in the laboratory. On its shorter end, the cart has two handles, a small screen, a stamp pad, a plastic bag for packaging waste and a second bag, which is filled with more small plastic bags. Apart from the screens, all items could be purchased. A labeled cardboard replicates the screen. The research group gives information to the subject, which is usually displayed on the screens. For example, this information might be the retrieval location or the picking quantity. Subjects deploy a stamp and a knife. They are attached to the OMoCap suit. Additionally, subjects operate a handheld scanner, which is attachable to the cart. To assure a natural motion of the subjects when using the scanner, all items have barcode labels that need to be scanned. Thus, the subjects have to use the scanner correctly to trigger an acoustic signal that confirms a scan operation. An order consists of several items that need to be picked in varying quantities. For each order-picking cycle, one cart works on the orders of several customers at the same time. This is referred to as order batching. The articles are household goods of varying dimensions and weights, such as cutlery, dishes, or storage boxes. They are stored in plastic and cardboard boxes and open lid bins. Some of the cardboard boxes were sealed with tape for protecting the goods. These storage units are placed on shelves with different heights or on the ground. Stacking frames and shelves formed two aisles. In the real-world system, a flow-through rack is deployed for goods consolidation. In the laboratory, pipe-racking systems were used to recreate it. Each chute of the flow-through rack is equipped with a barcode label and a human-readable ID. In general, the subject scans all labeled units to ensure that the correct article is picked, e.g., a single article or a newly labeled plastic bag. There are three cases for scanning an article’s barcode label. In the first case, the articles are individually packed. Every article already has a barcode label attached. Second, some articles are in a secondary packing, e.g., a cardboard box or a plastic bag that needs to be opened before retrieval. The articles in this secondary packaging have an individual barcode. Third, some articles do not have an attached-barcode label. In this case, the barcode at the shelf has to be scanned. There is a barcode label roll, which is provided next to the respective articles. These labels need to be attached to the retrieval unit. To begin the order-picking process, the subjects scan the barcode of the cart to trigger the order-picking mode. The screen shows the next retrieval location. When they arrive there, they scan the article’s barcode label, which may be found on the article, or on the shelve as explained previously. If the article is correct, the screen indicates the correct withdrawal quantity. Next, the subject retrieves the correct amount of articles. If necessary, they open sealed cardboard boxes with the knife. They dispose packaging waste using the plastic bag at the cart. If the article already has a barcode label, the subject can scan it so that the PtL-Frame visually indicates the correct box to submit the articles. For articles that do not have a barcode label, the subject wraps the desired quantity of articles in a plastic bag and seals it with a barcode label provided at the shelf. Pressing a button confirms each submission into a box on the cart. The button is on the PtL-frame above the box. If this is the first item in a box, the box must be marked with a stamp. This is a quality assurance to trace back the employee who packed the box. The subject takes the cart from one retrieval location to the next until the order is complete. The order-picking process is proceeded by the consolidation of the packed goods for dispatching preparation. For consolidation, the boxes must be inserted on the back side of a flow-through rack. On the front site the packaging, workplaces are located where dispatch preparation takes place. As with the order-picking mode, the subjects scans a specific barcode on the code to trigger the consolidation mode. Next, they take the cart to the consolidation point, which is shown on the cart’s display. The subjects scan the barcode of a box so that the scanner’s display shows the correct chute. After they inserted the box, they scan the barcode label at the chute to confirm the submission. This procedure repeats until there are no more boxes on the cart.

3.2.3. Logistics Scenario 3—Real-World Packaging Process

The third scenario is the packaging process that follows the order picking and consolidation of scenario 2 in the same real-world warehouse. The packaging process serves the dispatch preparation of the picked articles. In general, the consignment size per order does not exceed 5 boxes. Thus, the shipping by pallet is not feasible. The real-world packaging process is illustrated in Figure 6. Its physical laboratory set-up can be observed in Figure 7.
Figure 6

Business process model of logistics Scenario 3—real warehouse packaging work station.

Figure 7

Physical laboratory set-up of logistics Scenario 3—real warehouse packaging work station.

Each packaging workplace is equipped with a computer, a printer, a bubble wrap dispenser, a tape dispenser, a scale for weighing boxes, and a trash bin. Next to the table, a conveyor is located where all boxes have to be placed that are ready for shipment. The packaging table in the laboratory is a model often found in real warehouses. Further tables were placed next to the packaging table to provide space for the equipment. The table on the far left was used to recreate the surface of the conveyor. When a box was pushed onto the surface, a researcher took the box. The actual motion of a conveyor is not necessary to ensure a human motion that is close to reality. The dimensions of the tables in the laboratory closely resemble the table from the real-world warehouse. For the tools, equipment has been purchased that is similar to the real-world system. The bubble wrap dispenser was recreated by cutting a small opening in a cardboard box. The wrap was refilled manually by the researchers present during the recordings. A fully functional computer was placed on the table. Mouse and keyboard were attached to the computer and a spreadsheet application was running on it. When computer work was necessary, the subjects were tasked to perform basic tasks in the program. The printers were substituted by a researcher handing the printed items to the subject. As the weight scale is an area on the table’s surface, it could be recreated by indicating a certain area with colored stripes. As explained previously, all boxes to be prepared for shipment were stored in a flow-through rack. During recordings, the rack from Scenario 2 was used. It was moved next to the packaging table. When recordings were conducted, second and third scenarios were in immediate succession, the flow-through rack was already filled with boxes, which were filled with articles. By the beginning of the packaging process, the subject goes to the computer and chooses a packing order. Next, they take all boxes that belong to one order from the flow-through rack and place them on the packing table. The rubber band of each box is removed and the barcode needs to be scanned with the hand scanner. When doing so, the packing list of the order is printed automatically. For each box, the subject evaluates its filling level to decide whether repacking is necessary. This is the case when the box is either rather empty or overfull. In the first case, more articles from a different box of the same order are added. In the second case, the articles protrude the box. Articles may be bigger than the box, due to incorrect article master-data. When the filling level is low, the contents of several boxes are combined. When a box is removed from the order, this information must be entered into the computer. Contents of an overfilled box are put into a bigger one. The subject can get boxes of different sizes from storage next to the packing table. When repacking articles from one box to another, each one needs to be scanned and the repacking must be confirmed at the computer. The subject confirms that all boxes of an order are filled properly. In case the packing list has been altered due to repacking, it is reprinted automatically. Next, the subject puts the packing list in each box and fills them up with bubble wrap. Then, each box must be pushed onto a scale. The subjects need to trigger the weighing process at the computer. The system will check if the actual weight of the box corresponds to the expected weight according to the master data and the packing list. Once all boxes are packed correctly and their weight has been approved, the subjects seal them using a tape dispenser. The printer automatically prints the shipping labels when all boxes of one order are ready to be sealed. The subject applies a label to each box. Eventually, each box is pushed onto the conveyor surface.

3.3. Configuration of Sensors and Markers

The OMoCap system tracked 39 reflective markers from a suit, see Figure 8. A VICON system consisted of 38 infrared cameras recording at a sampling rate of 200 fps. Three different sets of on-body devices or IMUs record tri-axial linear and angular acceleration, see Figure 9. IMU-sets 1 and 3 served as proof of concept and they are not part of the dataset. The six IMUs of the second set from MbientLab [29] are attached to the arms, legs, chest, and waist. They record tri-axial linear and angular acceleration at a rate of 100 Hz.
Figure 8

Marker position on a Optical marker-based Motion Capture (OMoCap) suit.

Figure 9

Positions of on-body devices (inertial measurement unit (IMU)) from set 1 (Texas Instruments Incorporated), set 2 (MbientLab), and set 3 (MotionMiners GmbH).

3.4. Characteristics of Participating Subjects

A total of 14 subjects (S) were involved in the recording process. Their characteristics, including sex, age, weight, height, and handedness, are listed in Table 1. Examining the minimum and maximum of these characteristics show that a wide spectrum of physical characteristics is present. Thus, the subjects’ motion patterns vary widely. In addition, the ratio of left-handed to right subjects closely resembles the general population [30,31].
Table 1

Subject: specifications and scenario assignment.

IDSexAgeWeightHeightHandednessOMoCapIMU-setScenario 1Scenario 2Scenario 3
[F/M][year][kg][cm][L/R][1][2][3][Number of Two-Minute Recordings]
S01M2878175Lxx 2900
S02F2462163Lxx 3000
S03M5971171Rxx 2700
S04F5364165Lxx 2900
S05M2879185Rxx 2600
S06F2252163Rxx 3000
S07M2365177Rx xx21314
S08F5168168Rx xx21314
S09M35100172Rx xx21413
S10M4997181Rx xx21312
S11F4766175Rx xx2120
S12F2348163Rx xx0614
S13F2554163Rx xx21414
S14M5490177Rx xx21414
Min. 22 48 163
Avg. 37 71 171
Max. 59 100 185
Sum 185 99 95
All subjects participated in a total of 30 recordings of 2 min each, which corresponds to about 30 × 2 × 14 = 840 min of recorded material. In Scenario 1, subjects 1 to 6 performed 30 recordings wearing the OMoCap suit and the IMU-set 1. Subjects 7 to 14, wearing the OMoCap suit and the IMU-sets 2 and 3, participated in 2 recordings in Scenario 1, 14 recordings in Scenario 2 and 14 packing recordings in Scenario 3. Due to heavy noise and issues with the sensor readings, some recordings had to be scrapped, and they are not included in the dataset. Thus, the number of recordings per subject deviates in Table 1. A total of 379 recordings (758 min) were annotated and are included in the dataset. Figure 10 shows the varying physical features of all subjects true-to-scale.
Figure 10

Subjects before the recordings.

3.5. Recording Procedure

The LARa dataset was recorded in 7 sessions. In the first 3 sessions, subjects 1 to 6 went through Scenario 1. In sessions 4 to 7, data were recorded in all three scenarios with subjects 7 to 14.

3.5.1. Preliminaries

Before the recording, each subject was measured according to the information necessary for the VICON Nexus software: body mass, height, leg length, knee width, ankle width, shoulder offset, elbow width, wrist width, and hand thickness. Subsequently, the test subjects were equipped with an OMoCap suit, a headband, and work safety shoes, as used in real warehouses. Markers and IMUs were attached to the suit. To document the proper positioning of all markers and IMUs, each subject was photographed from four sides before the recording.

3.5.2. Recording Process

For the sake of recording realistic motions, the subjects were introduced to the scenarios by a domain expert in advance. Test runs were carried out before recordings commenced. The subjects were allowed to familiarize themselves with the processes and objects. The subjects do not perform individual and isolated movements as in other datasets that originate from laboratories, e.g., [32]. Rather, realistic motion sequences were the goal. To achieve this, the subjects were only instructed about their tasks within a scenario. They were not told how to perform specific motions necessary to fulfill their task. Thus, the way they handled items, picked boxes and moved to a location were not influenced by the researchers. The motion is solely determined by each subject’s individual preference. In addition, the subjects were not given detailed information about the underlying research goal to avoid a bias in their motion behavior. Between each recording unit of two min, a break of only a few seconds was necessary to start the next capture. Hence, the subjects would be able to remain focused on the task. The subjects did not take off the suit between recordings. After the recordings concluded, each subject was photographed again from four sides to reassure the proper positioning of the markers and sensors.

3.5.3. Documentation and Protocol

A protocol was kept before, during and after the recordings for each subject to ensure repeatability of the recording sessions: time, size of the suit and shoes, room temperature, the use of velcro to fit the suit to the person, RGB video files that were created, number and descriptions of photos taken, remarks, and incidents. The expenditure of time for recording is made up of the OMoCap system’s calibration, the preparation of each subject, their introduction to the scenarios and the recordings. In total, the expenditure was over 197 PHR ( days) to record 14 h of material. To support the subsequent annotation of the data, the sessions were captured by a RGB camera. In sessions 1 to 3, only occasional recordings were created with a RGB camera, but at least one recording per subject is available. Due to the increasing complexity of human motion and the increasing spectrum of objects in Scenarios 2 and 3, i.e., the subjects 7 to 14, were captured entirely by a camera to ensure that the performed activities are apparent to the annotators. In addition to the 14 subjects, the RGB camera recorded other people who were in the test field at the same time. They provided guidance when the task was unclear, ensured that none of the markers or sensors detached and continuously maintained the experimental setup e.g., by refilling the shelves with packed goods. In addition, photos taken before and after the recordings are included in the protocol. The Remarks section in the protocol includes the number and time of the breaks taken by the subjects, re-calibration of the OMoCap system during the session, injuries of the subjects and unusual movement during recording, e.g., drinking. Incidents mainly include lost or shifted markers and sensors. If a loss was observed during a recording, it was aborted, deleted, and restarted from the beginning. In three instances, a detachment was noticed after the recording session: Incidents with respect to S11: After recording 27, it was noticed that the marker of the left finger (see Figure 8, marker number 22) was misplaced. The reseach group could not determine when exactly the marker shifted its position. After recording 30, it was noticed that the marker of the right ankle (see Figure 8, marker number 35) was lost. Incidents with respect to S13: After the last recording (number 30), it was noticed that the marker from the right finger (see Figure 8, marker number 23) and the marker from the left wrist (see Figure 8, marker number 18) were missing. One of the lost markers was found on the left side of the subject’s chest. Incidents with respect to S14: After recording number 15, it was noticed that the marker of the right forearm (see Figure 8, marker number 17) was stuck to the leg. For the subsequent recordings (number 16 to 30), the marker was put back to its proper position. Despite these incidents, the data acquired through these recordings were found to be usable.

3.6. Classes and Attributes

This subsection explains the definitions of human activities in the dataset. The dataset considers periodic and static activities, following [1]. The dataset contains annotations of semantic coarse-descriptions of the activities. These semantic definitions are called attributes and they are motivated by HAR methods in [13,16]. An attribute representation can be seen as an intermediate binary-mapping between sequential data and human activities. This intermediate mapping is beneficial for solving HAR problems because they allow sharing high-level concepts among the activity classes. The consequences of unbalanced class-problem can be reduced. A dataset for HAR contains a set of N sequential samples —for LARa dataset either the OMoCap or the IMUs. D represents the number of joints or sensors for each dimension . This parameter is also addressed as the number of sequence channels; their respective activity classes from a set of C activity classes. Following the method in [16], this dataset provides additionally attribute annotations , where is drawn from an attribute representation . A is a binary attribute-representation of size , with M number of attribute representations of size K for all of the activity classes. A single attribute representation serves as an intermediate representation between an input signal and the expected activity class , i.e., . There are M different attribute representations. This is different from [16], where the authors assign a single, random attribute-representation to an activity class. In this work, the number M of representations is stated after the annotation process. In the annotation process, a set of attributes are given to short windows of the recordings, concerning the human movements. Table 7 shows the number of different attribute representations per activity class in LARa. The definition of activities and their semantic attributes is derived from the researchers’ experience [13,33], and from HAR methods [1,16]. The attributes’ and activities’ terminology by default implies industrial context. This excludes irrelevant activities for warehousing, such as smoking or preparing coffee. This is referred to as a Closed-World Condition [34].

3.6.1. Activity Classes

There are eight activity classes, see Table 2. Standing, Walking and Cart emphasize the subject’s locomotion. The Handling activities refer to a motion of the arms and hands when manipulating an article, box, or tool. These activities do not consider holding an element while standing or walking. Synchronization is crucial for proper annotation and for transferring the labels to different sensor streams.
Table 2

Activity Classes and their semantic meaning.

Activity ClassDescription
c1 StandingThe subject is standing still on the ground or performs smaller steps. The subject can hold something in hands or stand hands-free.
c2 WalkingThe subject performs a gait cycle [35] (pp. 3–7) while carrying something or the subject is walking hands-free. The only exception is made in regards to a cart (see below).
c3 CartThe subject is walking (gait cycle) with the cart to a new position. This class does not include the handling of items on the cart like putting boxes or retrieving items. Likewise, the handling of the cart, e.g., turning it to better reach its handles, is not included.
c4 Handling (upwards)At least one hand reaches the height of the shoulder height (80% of a person’s total height [36] (p. 146)) or is lifted beyond that during the handling activity.
c5 Handling (centred)Handling is possible without bending over, kneeling, or lifting arms to shoulder joint height.
c6 Handling (downwards)The hands are below the height of the knees (lower than 30% of a person’s total height [36] (p. 146)). The subject’s spine is horizontal or they are kneeling.
c7 SynchronizationWaving Motion where both hands are above the subject’s head by the beginning of each recording.
c8 NoneExcerpts that shall not be taken into account, because the class is not recognisable. Reasons are errors or gaps in the recording or a sudden cut by the end of a recording unit.

3.6.2. Attributes

There are attributes . These are coarse-semantic descriptions of the activities. They are mostly related to the locomotion and the pose when moving. The human pose changes according to handling different elements and to different heights. The attributes are subdivided in five groups, see Table 3 and Figure 11.
Table 3

Attributes and their semantic meaning.

AttributeDescription
I - Legs
A Gait CycleThe subject performs a gait cycle [35] (pp. 3–7).
B StepA single step where the feet leave the ground without a foot swing [35] (pp. 3–7). This can also refer to a step forward, followed by a step backwards using the same foot.
C Standing StillBoth feet stay on the ground.
II - Upper Body
A UpwardsAt least one hand reaches the height of the shoulder height (80% of a person’s total height [36] (p. 146)) or is lifted beyond that during the handling activity.
B CentredHandling is possible without bending over, kneeling or lifting arms to shoulder joint height.
C DownwardsThe hands are below the height of the knees (lower than 30% of a person’s total height [36] (p. 146)). The subject’s spine is horizontal or they are kneeling.
D No Intentional MotionDefault value when no intentional motion is performed, e.g., when standing without doing anything, carrying a box or walking with a cart. This is because there is no intentional motion when performing these activities, only a steady stance.
E Torso RotationRotation in the transverse plane [37] (pp. 2–3). Either a rotating motion, e.g., when taking something from the cart and turning towards the shelf or a fixed position when handling something while the torso is rotated.
III - Handedness
A Right HandThe subject handles or holds something using the right hand.
B Left HandThe subject handles or holds something using the left hand.
C No HandHands are not used, neither for holding nor for handling something.
IV - Item Pose
A Bulky UnitItems that the subject cannot put the hands around, e.g., boxes.
B Handy UnitItems that can be carried with a single hand or that the subjects can put their hands around, e.g., small articles, plastic bags.
C Utility AuxiliaryUse of equipment, e.g., scissors, knives, bubble wrap, stamps, labels, scanners, packaging tape dispenser, adhesives etc.
D CartEither bringing the the cart into proper position before taking it to a different location (Handling) or walking with the cart to a new location (No Intentional Motion).
E ComputerUsing mouse and keyboard.
F No ItemActivities that do not include any item, e.g., when the subject fumbles for something when on the search for a specific item.
V - None
A NoneEquivalent to the None class.
Figure 11

Semantic attributes.

During the labeling, annotators follow these rules: at least one for the attributes per group must be assigned; In group I, the attributes are disjoint, since a subject performs either one of the motions at the same time; The attributes A-D of group II are disjoint while the torso rotation is independent. In the third group, the choice between right and left is non-exclusive as one can use both arms at the same time. In group IV, the attributes are disjoint. Annotators give priority to the items according to a hierarchy: ; the None and the Synchronization classes have a fixed attribute representation. The execution of the waving motion for synchronizing is predefined.

3.6.3. Exemplary Activity Sequence and Its Proper Annotation

Table 4 shows an exemplary warehousing process that consists of four process steps. This process is an excerpt from Scenario 2. In the first process step, the subject is initially standing (Act. 1) before walking to the cart without holding anything in hands (Act. 2). Then, the cart is brought into proper position with both hands while performing smaller steps (Act. 3) and the subject pulls the cart to the retrieval location using the right hand (Act. 4).
Table 4

Exemplary picking process broken down into process steps, activities, classes, and attributes.

Attribute Representation
I LegsII Upper BodyIII Hand.IV Item Pose
Gait CycleStepStanding StillUpwardsCenteredDownwardsNo Intentional MotionTorso RotationRight HandLeft HandNo HandBulky UnitHandy UnitUtility/AuxiliaryCartComputerNo Item
Process StepAct.ClassABCABCDEABCABCDEF
1 Bring cart to1c1 Standing00100010001000001
retrieval2c2 Walking10000010001000001
location3c5 Hand. (cen.)01001000110000100
4c3 Cart10000010100000100
2 Scan Barcode5c1 Standing00100010100000100
6c5 Hand. (cen.)01001000010010000
7c5 Hand. (cen.)00101000010010000
8c4 Hand. (upw.)00110001010001000
9c5 Hand. (cen.)01001000010010000
3 Retrieve item10c4 Hand. (upw.)01010000100010000
and put in11c4 Hand. (upw.)00110000100010000
box12c4 Hand. (upw.)01010000100010000
13c6 Hand. (down.)01000100110010000
14c6 Hand. (down.)00100100110010000
4 Confirm pick15c6 Hand. (down)00100100100001000
By the beginning of the process step 2, the subject is standing while resting the right hand on the cart’s handle (Act. 5). Then the subject proceeds to take the scanner from the cart. The first half of this left-handed handling motion is done while performing a step (Act. 6), while the latter is performed while standing with both feet on the ground (Act. 7). It is important to note that the scanner is annotated as a Handy Unit because it is handled as such. In contrast, using it in the following activity is annotated with Utility Auxiliary. The label is located on the subject’s right and on eye level so a Torso Rotation is necessary and the handling is performed upwards (Act. 8). The ninth activity refers to the subject mounting the scanner back on the cart (Act. 9). In the third process step, the subject picks the item from the shelf (Act. 10–12) and places it in a box located on the lowest level of the cart (Act. 13 and 14). Finally, the pick is confirmed by clicking the put-to-light button located above the box (Act. 15). There is a wide variety of activity sequences that may constitute the same process. For example, different subjects use different hands when handling an element. In addition, their body motions differ when lifting something from the same height depending on their body size. Thus, the exemplary sequence of activities in Table 4, their class labels and attribute representation are one of many viable options.

3.7. Annotation and Revision

A Python tool was created for annotating the OMoCap data, see Figure A3. The procedure of the annotation and revision is described by Reining et al. [38]. The annotation tool offers a visualization of the skeleton from the OMocap data and a window-based annotation frame. A window is a segment that is extracted from the sequential data. In the annotation process, an annotator provides the activity class and the attribute representation of a window. Window sizes are variable. The annotator selects consequently the size of a window. Twelve annotators labeled the OMoCap data of the 14 subjects. Apart from two annotators, none of them had any prior experience regarding the annotation of OMoCap data. Each annotator followed the guidelines, as mentioned in Section 3.6. Additionally, RGB videos served as an additional aid for complex activities.
Figure A3

Screenshot of the annotation and revision tool during the annotation.

The total time effort for annotation comprised over 474 PHR ( days or months). Table 5 illustrates the annotation effort per individual annotator. The information given in the table relates to two-minute recordings. With a range of 39 min to almost 3 h of annotation per recording, the annotators differ greatly in their annotation speed. The reasons for the different annotation speeds are the different level of experience of the annotators, the different setting of window sizes of activities and the individually selectable playback speed of the OMoCap recordings in the annotation tool. An average of min was required for a one-minute recording.
Table 5

Annotation effort of all annotators.

IDTotal TimeNo of Rec.Time per Rec.
[hh:mm:ss][hh:mm:ss]
A0155:12:195201:14:02
A0273:22:044501:55:21
A0356:30:395401:14:13
A0434:39:082601:28:00
A0584:18:373002:48:37
A0639:24:166400:39:46
A0728:40:572501:10:35
A0832:56:402701:15:24
A0933:28:452701:14:24
A1010:14:211200:51:12
A1123:03:161401:38:48
A1202:16:00301:45:03
Min. 00:39:46
Max. 02:48:37
Sum 474:07:02 379
Following the annotation, data were revised by four domain experts, see Table 6. The revision of an annotated two-minute recording varied between 4 and 121 min, depending on the quality of the annotation. Compared to the annotation, the average time for a revision is significantly lesser at min for a one-minute recording.
Table 6

Revision effort of all revisers.

IDTotal TimeNo of Rec.Time per Rec.
[hh:mm:ss][hh:mm:ss]
Re0113:44:008800:09:22
Re0239:18:009700:24:19
Re0328:37:009100:18:52
Re0461:19:0010300:35:43
Min. 00:09:22
Max. 00:35:43
Sum 142:58:00 379
The dataset is unbalanced. The Handling classes represent nearly of the recordings. These classes show also a higher variability of their attribute representations; this means that these classes show up in many different forms. The class Handling (centered) is the most frequent activity by far. The representations of the Walking activity class differ in regards to the handedness and Item pose. This is because the Gait Cycle and the No Intentional Motion attribute are fixed. The third class Cart can only have three representations. Either the cart is pushed or pulled using the Left Hand, the Right Hand, or with both hands while walking. By definition, there is only one valid representation for both Synchronization and None classes. This is reflected in the results of the annotation and revision, see Table 7.
Table 7

Annotation results divided by activity classes.

Stand.Walk.CartHandling (upwards)Handling (centred)Handling (downwards)Synchron.None
Samples 974,611994,8801,185,788754,8073,901,899673,655158,655403,737
Avg. Time/Occ. [s] 1.713.726.462.724.392.742.167.10
Proportion [%] 10.7711.0013.118.3443.127.451.754.46
[M] number of Attr. representations 287345724711

3.8. Folder Overview of the LARa Dataset

LARa contains data of an OMoCap system, one IMU-set, and one RGB camera as well as the recording protocol, the tool for annotation and revision and the networks of activity classes and attributes. Table 8 illustrates an overview of the sizes of the folders and the formats of the files.
Table 8

Folder overview of the LARa  dataset.

FolderFolder Size [MiB]File FormatRecording Rate
OMoCap data33,774csv200 fps
IMU data - MbientLab1355.77csv100 Hz
RGB videos17,974.82mp430 fps
recording protocol2.58pdf-
annotation and revision tool2899.99py-
class_network1449.55pt-
attrib_network1449.55pt-
The files of the OMoCap data, IMU data, and RGB videos are named after the logistics scenarios, subject, and recording. For example the file name means logistics scenario 01, subject 02, recording 12.

4. Deploying LARa for HAR

The tCNN, proposed in [18], was deployed for solving HAR using the LARa dataset. Some minor changes on the architecture are, here, proposed. Our tCNN contains four convolutional layers, no downsampling operations, and three fully-connected layers. Downsampling operations are not deployed as they affect the performance of the network negatively following the conclusions of [16]. The convolutional layers are composed of 64 filters of size , which perform convolutions along the time axis. The first and second fully connected-layers contain 128 units. Considering the definitions in Section 3.6, there are two different last fully connected layers, depending on the task. A softmax layer is used for direct classification of the activity classes. It has units. A fully connected layer with sigmoid activation function is used for computing attributes. This layer contains 19 units. The number of output units corresponds to either the number of classes or attributes, respectively, see Section 3.6. Figure 12 shows the tCNN’s architecture.
Figure 12

The Temporal Convolutional Neural Network (tCNN) architecture contains four convolutional layers of size . According to the classification task, there are two types of last fully-connected layer: a softmax and a sigmoid.

The architecture processes sequence segments that consist of a feature map input of size , with T the sequence length and D the number of sequence channels. The sequence segments are extracted following a sliding-window approach with window size of , step size of ( overlapping). The number of sequence channels D is 126, as there are measurements of position and rotation in for the 21 joints of the LARa OMoCap dataset. This excludes the joint “lower_back” as it is used for normalizing the human poses with respect to the subject. In general, the input sequence is for the dataset. The tCNN computes, either, an activity class or a binary-attribute representation from an input sequence. Predicting attribute representation follows the method in [16]. Differently from a standard tCNN, this architecture contains a sigmoid activation function replacing the softmax layer. The sigmoid activation function is computed as . This function is applied to each element of the output layer. The output can be considered as binary pseudo-probabilities for each attribute being present or not in the input sequence. The architecture is trained using the binary-cross entropy loss given by: with the target attribute representation and the output of the architecture. Following [3,12], input sequences are normalized per sensor channel to the range of . Additionally, a Gaussian noise with parameters is added. This noise simulates sensor’s inaccuracies. Following the training procedures from [1,2], the LARa OMoCap is divided into three sets: the training, validation, and testing. The training set comprises recordings from subjects . The validation and testing sets are comprised of recordings from and , respectively. An early stopping approach is followed using the validation set. This set also is deployed for finding proper training hyperparameters. Recordings with label None are not considered for training following the procedure in [3]. The architecture is trained using the batch gradient-descent with RMSProp update rule with an RMS decay of , a learning rate of , and a batch size of 400. Moreover, Dropout was applied to the first and second fully-connected layers. In the case of predicting attributes and for solving HAR, a nearest neighbor (NN) approach was used for computing a class . The Euclidean distance is measured from the predicted attribute vector to attribute representation , with and . This is possible as each activity class is related to a certain number of binary-attributes vectors in the attribute representation A, see Table 7. LARa provides the attribute presentation A. Both the attribute vector and the attribute representations A are normalized using the 2-norm. The tCNN is also trained using a softmax layer predicting activity classes directly. In this case, the architecture is trained using the Cross-Entropy Loss. Table 9 and Table 10 present the performance of the method solving HAR on the LARa OMoCap dataset using the softmax layer and the attribute representation. Precision is computed as . Recall is computed as . Having , , as the true positives, false positives, and false negatives. The weighted F1 is calculated as , with being the number of window samples of class . Handling and moving Cart activities show the best performances. Using the attribute representation boost the performance in comparison with the softmax classifier. The approach classifies the Synchronization and Standing activities when using attribute representations. In general, deploying an attribute representations boosts the performance of HAR. These results coincide with [13,16]. Attributes belonging of frequent classes help with the classification of less frequent classes. The effects of the unbalanced problem are also reduced.
Table 9

Recall and precision of human activity recognition (HAR) on the LARa OMoCap dataset.

OutputMetricPerformance
Stand.Walk.CartHand. (up.)Hand. (cent.)Hand. (down.)Sync.
SoftmaxRecall[%]3.1171.9671.3461.3987.4065.300.0
Precision[%]73.0045.2981.3557.1070.8580.720.0
AttributesRecall[%]55.86.54.31.76.1269.1680.9974.3669.84
Precision[%]24.2260.5992.1379.0882.9474.6389.31
Table 10

The over-all accuracy and weighted F1 of HAR on the LARa OMoCap dataset.

MetricPerform.
SoftmaxAttributes
Acc[%]68.8875.15
wF1[%]64.4373.62
Table 11 and Table 12 show the confusion matrices of the predictions using the tCNN in combination with: the softmax layer and the NN using the attribute representation. In general, the method exhibits difficulties predicting the class Standing. The method mispredicts Standing sequence segments as Handling (centered) ones. The class Walking present also some mispredictions. Following the results on Table 9 and Table 10, solving HAR using the attribute representation offers a better performance in comparison with the usage of a softmax layer. The classification of activity classes, e.g., Synchronization, Standing, and Walking, improve significantly.
Table 11

Confusion matrix from the class predictions using tCNN with the softmax layer.

ActivitiesConfusion Matrix
Stand.Walk.CartHand. (up.)Hand. (cent.)Hand. (down.)Sync.
Stand. 311 18072111347446860
Walk.42 3776 46135918150
Cart1928 9005 4268410
Hand. (up.)092152 4188 238910
Hand. (cent.)72168112331717 36,587 5720
Hand. (down.)0512121437 2826 0
Sync.02612451780 0
Table 12

Confusion matrix from the class predictions using the attribute predictions with tCNN and the nearest neighbor (NN) approach.

ActivitiesConfusion Matrix
Stand.Walk.CartHand. (up.)Hand. (cent.)Hand. (down.)Sync.
Stand. 2421 14925062684668219421
Walk.633 3179 69147649417
Cart44298 11,630 2062920
Hand. (up.)714432 5395 12182042
Hand. (cent.)108582524031919 34,719 83279
Hand. (down.)7315169982 3230 3
Sync.70014330 1278
Table 13 presents the performance on the attributes. Attributes are correctly classified in general. Attributes none and error are not present in the test dataset. However, they are not misclassified. Attribute Torso Rotation is also not mispredicted. Nevertheless, the precision and recall of this attribute are zero. This suggests that it is not classified when it shall be. Further, an improvement in this particular attribute is needed.
Table 13

The accuracy, precision, and recall for the attributes on the test dataset.

MetricAttributes
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19
Accuracy89.376.984.593.981.796.482.596.992.079.190.376.271.785.291.398.390.2100100
Precision79.082.883.480.485.676.786.30.092.881.691.948.760.474.388.898.895.40.00.0
Recall82.170.392.073.183.968.772.00.098.592.436.237.263.026.874.249.541.00.00.0
Both trained tCNNs (using a softmax and a sigmoid layer) and the attribute representation A are included in the annotation and revision tool. Implementation code of the annotation tool is also available in [39]. These results seek to give a first evaluation of the dataset for solving HAR.

5. Discussion and Conclusions

This contribution presents the first freely accessible dataset for the sensor-based recognition of human activities in logistics using semantic attributes, called LARa. Guidelines for creating a dataset were developed based on an analysis of related datasets. Following these guidelines, 758 min of picking and packing activities of 14 subjects were recorded, annotated, and revised. The dataset contains OMoCap data, IMU data, RGB videos, the recording protocol, and the tool for annotation and revision. Multichannel time-series HAR was solved for LARa using temporal convolutional neural networks (tCNNs). Classification performance is consequent to the state-of-the-art using tCNNs. Semantic descriptions or attributes of human activities improve classification performance. This supports the effort of annotating attributes and the conclusions from [16]. From an application perspective, the following approaches for fundamental research as well as industrial application result from the LARa dataset: The laboratory dataset LARa will be deployed on IMU data recorded in an industrial environment. The addition of more subjects and the inclusion of further logistical processes and objects is conceivable. New attributes may be added. Another approach to recognize human activity is the context. The context may provide information about locations and articles and broaden the application spectrum of the dataset. Context information about the process is provided in this contribution. Dependencies between the activities have to be examined, e.g., state-machines. Can information about dependencies increase the accuracy of the recognition of human activities in logistics? Finally, the industrial applicability must be proven through a comparison between sensor-based HAR and manual-time management methods, such as REFA and MTM. Can manual-time management methods be enhanced using HAR and LARa? Further experiments concerning the relation among the activity classes and the attributes will be relevant to evaluate. Analyzing the architecture’s filters and their activations using the attribute representations will be useful for understanding how the deep architectures process input signals. LARa dataset can be used for solving retrieval problem in HAR. Retrieval tasks might help facilitate data annotation. Additionally, data-stream-approaches will be relevant to be addressed using this dataset. A comparison of HAR methods using statistical pattern recognition and deep architectures is to be also addressed. The extensive LARa dataset will be of use for investigating RNNs. Computational causal behavior models are of interest for including the flow charts of the scenarios and longer temporal relations of the input signals.
Table A1

Content criteria for filtering process.

StageContent CriteriaDescription
I HumanData must relate to human movements.
II SensorDataset must contain IMU or OMoCap data, or both.
III AccessDataset must be accessible online, downloadable and free of charge.
IV Physical ActivityCaspersen et al. [44] defined physical activity “as any bodily movement produced by skeletal muscles that results in energy expenditure”. The definition of physical activity is limited by torso and limb movement [10].
Table A2

Examined datasets per stage.

StageContent CriteriaNo of Datasets
I Human173
II Sensor95
III Access70
IV Physical Activity61
Table A3

Categorization scheme.

Root Category
 SubcategoryDescription
General Information
 YearYear of publication. Updates are not taken into account.
 Dataset nameName of the dataset and the acronym
 Ref. [Dataset]Link and, if available, the Identifier (DOI) of the dataset
 Ref. [Paper]Identifier or, if not available, link of the paper that describes the dataset, uses it or is generally given as a reference
Domain of the Act. class
 WorkOffice work, general work, and physical work in production and logistic
 ExercisesSport activity classes, e.g., basketball, yoga, boxing, golf, ice hockey, soccer
 Locomotione.g., walking, running, elevating, sitting down, going upstairs, and downstairs
 ADLActivity classes of daily living, e.g., watching TV, shopping, cooking, eating, cleaning, dressing, driving car, personal grooming, interacting, talking, lying
 Fall DetectionFalling in different directions and from different heights
 Hand GesturesFocus on the movement of hands, e.g., arm swiping, hand waving, and clapping
 Dancee.g., jazz dance, hip-hop dance, Salsa, Tango
Data Specification
 Recording Time [min]Total time of the recordings in minutes
 Data Size [MiB]Data Size of the entire unzipped dataset in mebibytes, including e.g., RGB videos, pictures
 FormatFormats of data published in the repository
 No SubjectsNumber of unique subjects
 No Act. classesNumber of individual activity classes
 List Act. classesList of all individual activity classes
 LaboratoryThe recordings were made in a laboratory environment
 Real LifeThe recordings were made in a real environment, e.g., outdoors, on a sports field, or in a production facility
Sensor
 OMoCap [Hz]Optical marker-based Motion Capture with frames per second or hertz as a unit
 IMU [Hz]Inertial measurement unit with hertz as a unit
 Other SensorsSensors except IMU and OMoCap
 Phone, Watch, GlassesUse of sensors built in smartphone, smartwatch, or smart glasses
Table A4

Overview of related public available human activity recognition datasets. The entries are sorted chronologically in ascending order according to the year of publication and alphabetically according to the name of the dataset. Missing informations are marked with “-”.

General InformationDomain of the Act. classData SpecificationSensorAttachment (Sensor/Marker)
YearDataset NameRef. [Dataset]Ref. [Paper]WorkExercisesLocomotionADLFall DetectionHand GesturesDanceRecording Time [min]Data Size [MiB]FormatNo SubjectsNo Act. ClassesLaboratoryReal LifeOMoCap [fps/Hz]IMU [Hz]Other SensorsPhone, Watch, GlassesHand/WristLower ArmUpper ArmFoot/AnkleLower LegUpper LegHipShoulderBelly/WaistThorax/ChestLower BackUpper BackHead
2003Carnegie Mellon University Motion Capture Database (CMU Mocap)[84]- xxx x-18,673amc11223x 120 xxxxxxxxxxxxx
2004Leuven Action Database[45][85] xxx -14text, avi, xls, pdf122x 30 RGB xxxx xxx x xx
2007HDM05[54][86] xxx x-3000.32c3d, amc, avi570x 120 RGB xxxxxxxx xxxx
2008Wearable Action Recognition Database (WARD)[87][88] xx -41.66mat2013 x 20 xx xxx
2009BodyAttack Fitness[83][89] x 155.64mat16x 64 xx
2009Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database[59][90] x -60,897.03amc, txt, asf, wav, xls, avi4329x 120125RGB, microphone, RFID, BodyMediaxxxxxxxxx xxxx
2009HCI gestures[83][89] x -12.9mat15x 96 xx
2009HumanEva I[91][92] xx -13,824-46x 120RGB, depth xxxxxxxx x xx
2009HumanEva II[93][92] x -4,649-24x 120RGB xxxxxxxx x xx
2010KIT Whole-Body Human Motion Database[48][65] xxx xx-2,097,152xml, c3d, avi22443x 100 RGB xxxxxxxx xxxx
2010Localization Data for Person Activity Data Set[55][94] xxx -20.5txt511x 10 x x x
20113DLife/Huawei ACM MM Grand Challenge 2011[95][96] x--svl, cvs155x 160RGB, microphone, depth x x x
2011UCF-iPhone Data Set[97][98] xx -13.1csv99 x 60 x x
2011Vicon Physical Action Data Set[46][99] xx 33.33144txt1020x 200 xx xx x
2012Activity Prediction (WISDM)[100][101] x -49.1txt296x 20 x x
2012Human Activity Recognition Using Smartphones Data Set (UCI HAR)[102][103] xx 192269txt306x 50 x x
2012OPPORTUNITY Activity Recognition Data Set[4][104] xx x 1500859txt1224 x 32 xxxxxxx xx
2012PAMAP2 Physical Activity Monitoring Data Set[5][105] xxx 6001652.47txt918xx 100heart rate monitor x x x
2012USC-SIPI Human Activity Dataset[76][106] x -42.7mat1412 x 100 x
2013Actitracker (WISDM)[107][108] xx -2588.92txt296 x 20 x x
2013Daily and Sports Activities Data Set[109][110] xxx 760402csv819 x 25 x x x
2013Daphnet Freezing of Gait Data Set[69][111] x 50086.2txt103x 64RGB xx x
2013Hand Gesture[74][1] x x 7047.6mat211 x 32 xxx
2013Physical Activity Recognition Dataset Using Smartphone Sensors[47][112] x -63.1xlsx46 x 50 x x xx x
2013Teruel-Fall (tFall)[113][114] x -65.5dat108 x 50 x x
2013Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements (PUC-Rio)[56][115] x 48013.6dat45x 10 x xx x
2014Activity Recognition from Single Chest-Mounted Accelerometer Data Set[116][117] xx 43144.2csv157 x 52 x
2014Realistic sensor displacement benchmark dataset (REALDISP)[118][119] xx 566.026717.43txt1733x 50 xx xx x
2014Sensors activity dataset[47][120] x 2800308csv108 x 50 x xx xx
2014User Identification From Walking Activity Data Set[121][117] xx 4314.18csv225 x 52RGB, microphonex x
2015Complex Human Activities Dataset[47][122] xx 390240csv1013 x 50 xx x
2015Heterogeneity Activity Recognition Data Set (HHAR)[60][123] x 2703333.73csv96 x 200 x x x
2015Human Activity Recognition with Inertial Sensors[57][124] xxx 496324mat1913x 10 x xx
2015HuMoD Database[125][126] xx 49.46044.27mat28x 500 EMG xxxxx xxxx
2015Project Gravity[61][127] xxx -27.6json319x 25RGBxx x
2015Skoda Mini Checkpoint[83][128]x x 18080.3mat110 x 98 xxx
2015Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set (SBHAR)[129][130] xx 300240txt3012 x 50RGBx x
2015UTD Multimodal Human Action Dataset (UTD-MHAD)[49][131] xx x -1316.15mat, avi827x 50depth x x
2016Activity Recognition system based on Multisensor data fusion (AReM) Data Set[132][133] xx 1761.69csv16x 70 x x x x
2016Daily Log[51][134] xxx 106,5604815.97csv733 x xGPSx x x
2016ExtraSensory Dataset[52][135]xxxx 308,320144,423.88dat, csv, mfcc6051 x 40microphonexx x
2016HDM12 Dance[136][136] x972,175.48asf, c3d2220x 128 xxxxxxxx x xx
2016RealWorld[75][67] xx 10653891.92csv158 x 50GPS, magnetic field, microphone, RGB, lightx xx xx xx x
2016Smartphone Dataset for Human Activity Recognition in Ambient Assisted Living[137][103] xx 94.7946.5txt306 x 50 x x
2016UMAFall: Fall Detection Dataset[138][139] xxx -359csv1914 x 200 xx x x xx
2017An Open Dataset for Human Activity Analysis using Smart Devices[62][140] xxx -433csv116 x x xx x x
2017IMU Dataset for Motion and Device Mode Classification[141][50] x -2835.21mat83 x 100 xxxxxxxxx xx
2017Martial Arts, Dancing and Sports (MADS) Dataset[71][142] x x-24,234.96mov, zip55x 60 RGB xxxxxxxx xxxx
2017Physical Rehabilitation Movements Data Set (UI-PRMD)[66][143] xx -4700.17txt1010x 100 depth xxxxxxxx x xx
2017SisFall[144][145] xxx 1849.331627.67txt3834 x 200 x
2017TotalCapture Dataset[72][146] xx ---55x x60RGB xxxxxxxx xxxx
2017UniMiB SHAR[147][148] xxx -255mat3017x 50microphonex x
2018Fall-UP Dataset (Human Activity Recognition)[149][150] xxx 165.0078csv1711x 100infrared, RGB x x xxx xx
2018First-Person View[63]- xx -1046.86mp4, csv27 x xRGBx x x x
2018HAD-AW[64][151]xxxx x-325xlsx1631 x 50 xx
2018HuGaDB[152][153] x 600401txt1812 x xEMG xxx
2018Oxford Inertial Odometry Dataset (OxIOD)[53][154] x 883.22751.73csv42xx250100 xx x
2018Simulated Falls and Daily Living Activities Data Set[155][156] xxx 6303972.06txt1736x 25 x xx xx x
2018UMONS-TAICHI[157][158] x -28,242.47txt, c3d, tsv1213x 179 RGB, depth xxxxxxxx x xx
2019AndyData-lab-onePerson[70][32]x xx 30099,803.46mvn, mvnx, c3d, bvh, csv, qtm, mp4136x 120240RGB, pressure sensor handglove xxxxxxxx xxxx
2019PPG-DaLiA[58][159]x xx 2,19023,016.74pkl, csv158 x 700PPG, ECG x x
Sum 61 5 20 51 35 9 6 7 33 30 25 29 30 24 20 28 36 31 16 10 25 11 18 21
Min. 15 1.69 1 2 30 10
Avg. 13,531.08 43,605.15 21.1 14.8 155.9 86.2
Max. 308,320 2,097,152 224 70 500 700
2019 LogisticActivityRecognitionChallenge (LARa) [160] x x 758 58,907.15 csv, mp4, pdf, pt, py 14 8 x 200 100 RGB x x x x x x x x x x x x x
  16 in total

1.  Perception of biological motion: a stimulus set of human point-light actions.

Authors:  Jan Vanrie; Karl Verfaillie
Journal:  Behav Res Methods Instrum Comput       Date:  2004-11

2.  Physical activity, exercise, and physical fitness: definitions and distinctions for health-related research.

Authors:  C J Caspersen; K E Powell; G M Christenson
Journal:  Public Health Rep       Date:  1985 Mar-Apr       Impact factor: 2.792

3.  SisFall: A Fall and Movement Dataset.

Authors:  Angela Sucerquia; José David López; Jesús Francisco Vargas-Bonilla
Journal:  Sensors (Basel)       Date:  2017-01-20       Impact factor: 3.576

4.  UP-Fall Detection Dataset: A Multimodal Approach.

Authors:  Lourdes Martínez-Villaseñor; Hiram Ponce; Jorge Brieva; Ernesto Moya-Albor; José Núñez-Martínez; Carlos Peñafort-Asturiano
Journal:  Sensors (Basel)       Date:  2019-04-28       Impact factor: 3.576

5.  Deep PPG: Large-Scale Heart Rate Estimation with Convolutional Neural Networks.

Authors:  Attila Reiss; Ina Indlekofer; Philip Schmidt; Kristof Van Laerhoven
Journal:  Sensors (Basel)       Date:  2019-07-12       Impact factor: 3.576

6.  Fusion of smartphone motion sensors for physical activity recognition.

Authors:  Muhammad Shoaib; Stephan Bosch; Ozlem Durmaz Incel; Hans Scholten; Paul J M Havinga
Journal:  Sensors (Basel)       Date:  2014-06-10       Impact factor: 3.576

7.  Detecting falls with wearable sensors using machine learning techniques.

Authors:  Ahmet Turan Özdemir; Billur Barshan
Journal:  Sensors (Basel)       Date:  2014-06-18       Impact factor: 3.576

8.  Complex Human Activity Recognition Using Smartphone and Wrist-Worn Motion Sensors.

Authors:  Muhammad Shoaib; Stephan Bosch; Ozlem Durmaz Incel; Hans Scholten; Paul J M Havinga
Journal:  Sensors (Basel)       Date:  2016-03-24       Impact factor: 3.576

9.  Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition.

Authors:  Francisco Javier Ordóñez; Daniel Roggen
Journal:  Sensors (Basel)       Date:  2016-01-18       Impact factor: 3.576

10.  UMONS-TAICHI: A multimodal motion capture dataset of expertise in Taijiquan gestures.

Authors:  Mickaël Tits; Sohaïb Laraba; Eric Caulier; Joëlle Tilmanne; Thierry Dutoit
Journal:  Data Brief       Date:  2018-05-23
View more
  2 in total

1.  Context-Aware Human Activity Recognition in Industrial Processes.

Authors:  Friedrich Niemann; Stefan Lüdtke; Christian Bartelt; Michael Ten Hompel
Journal:  Sensors (Basel)       Date:  2021-12-25       Impact factor: 3.576

2.  Recognizing Solo Jazz Dance Moves Using a Single Leg-Attached Inertial Wearable Device.

Authors:  Sara Stančin; Sašo Tomažič
Journal:  Sensors (Basel)       Date:  2022-03-22       Impact factor: 3.576

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.