Literature DB >> 32707928

LARa: Creating a Dataset for Human Activity Recognition in Logistics Using Semantic Attributes.

Friedrich Niemann¹, Christopher Reining¹, Fernando Moya Rueda², Nilah Ravi Nair¹, Janine Anika Steffens¹, Gernot A Fink², Michael Ten Hompel¹.

Abstract

Optimizations in logistics require recognition and analysis of human activities. The potential of sensor-based human activity recognition (HAR) in logistics is not yet well explored. Despite a significant increase in HAR datasets in the past twenty years, no available dataset depicts activities in logistics. This contribution presents the first freely accessible logistics-dataset. In the 'Innovationlab Hybrid Services in Logistics' at TU Dortmund University, two picking and one packing scenarios were recreated. Fourteen subjects were recorded individually when performing warehousing activities using Optical marker-based Motion Capture (OMoCap), inertial measurement units (IMUs), and an RGB camera. A total of 758 min of recordings were labeled by 12 annotators in 474 person-h. All the given data have been labeled and categorized into 8 activity classes and 19 binary coarse-semantic descriptions, also called attributes. The dataset is deployed for solving HAR using deep networks.

Entities: CellLine Chemical Disease Gene Species

Keywords: attribute-based representation; dataset; human activity recognition; inertial measurement unit; logistics; motion capturing

Mesh：

Year: 2020 PMID： 32707928 PMCID： PMC7436169 DOI： 10.3390/s20154083

Source DB: PubMed Journal: Sensors (Basel) ISSN： 1424-8220 Impact factor: 3.576

1. Introduction

Human activity recognition (HAR) assigns human action labels to signals of movements. Signals are time series that are obtained from video-frames, marked-based motion capturing systems (Mocap), or inertial measurements. This work focuses on HAR using Mocap and inertial measurements. Methods of HAR are critical for many applications, e.g., medical and rehabilitation support, smart-homes, sports, and in industry [1,2,3]. Nevertheless, HAR is a complicated task due to the large intra- and inter-class variability of human actions [1]. In addition, extensive annotated-data for HAR is scarce. This is, mainly, due to the complexity of the annotation process. Moreover, datasets of HAR are likely to be unbalanced. Usually, there exists more samples of frequent activities, e.g., walking or standing in comparison with picking an article [4,5]. Warehousing is an essential element of every supply chain. The main purpose of warehousing is storing articles and satisfying customers’ orders in a cost and time-efficient manner. Despite an increase in automation and digitization in warehousing and the impression of a shrinking number of employees, the employee numbers are rising [6,7]. Manual order-picking and -packing are labor-intensive and costly processes in logistics. Information on the occurrence, duration, and properties of relevant human activities is fundamental to assess how to enhance employee performance in logistics. In the state-of-the-art, manual activities of the employees are mostly analyzed manually or analytically, using methods such as REFA Time Study [8] or Methods-Time Measurement (MTM) [9]. The potentials of sensor-based human activity recognition (HAR) in logistics are not yet well explored. According to Reining et al. [10], in the past ten years only three publications dealt with HAR in logistics [11,12,13]. One major reason for this is the lack of freely accessible and usable datasets that contain industrial work-processes. This is because industrial environments such as factories and warehouses pose a challenge for data recording. Regulations such as the European General Data Protection Regulation [14] further create barriers when handling sensitive data, such as the work performance of employees and their physical characteristics. Thus, scientists tend to fall back to pseudo-industrial laboratory set-ups for dataset creation. The closeness to reality of these low-scale laboratory set-ups is often questionable. For example, the recognition performance in [10] suffered from the recording procedure that split a workflow into activities that were each recorded individually. This was the case because the transitions between activities were not examined properly. In the ’Innovationlab Hybrid Services in Logistics’ at TU Dortmund University, manual processes of a real-world warehouse are replicated on an area of 220 m [15]. Fourteen individuals carry out picking and packaging activities in three scenarios under real-life conditions. All activities are recorded in a sensor-rich environment, using an Optical marker-based Motion Capture (OMoCap) system, three sets of inertial measurement units (IMU), and an RGB camera. All data streams are synchronized. In total, LARa contains 758 min of data. Twelve annotators labeled the OMoCap data in 474 person-h (PHR). A subsequent revision took 143 PHR for 4 revisers. Data are labeled using 8 activity classes and 20 binary coarse-semantic descriptions. These descriptions will be denoted as attributes [16]. Traditional methods of statistical pattern recognition have been used for HAR. These methods segment signal-sequences using a sliding-window approach, extract relevant features from the segmented sequences and train a classifier for assigning certain action labels. Recently, deep architectures have been used successfully for solving HAR problems. They are end-to-end architectures, which are composed of feature extractors and a classifier. They combine learnable convolutional operations with non-linear functions, downsampling operations, and classification layer [2,3,17,18]. These architectures map sequence segments of measurements from multichannel sensors into a single class or a semantic-based representation [16]. Stacked convolution and downsampling operations extract abstract and complex temporal relations from these input sequences. Attribute-based representations have been deployed for solving HAR. Attributes describe activity classes semantically [16]. For example, handling can be represented by moving the left, right, or both hands, and by its pose based on a picked article. Right-hand, left-hand and box can be considered as attributes. Attributes are used for sharing high-level concepts among activity classes. They are an additional mapping between sequence measurements of the data streams and activity classes. In [13,16], a single combination of attributes represents an activity class. Nevertheless, this limits the properties of attribute representations. As human actions vary, they could be represented by different combinations of attributes. This paper introduces a novel and large dataset for HAR in the context of logistics. This dataset contains annotations of human actions and their attributes in the intra-logistics. This paper explains in detail the recording scenarios, sensors settings, and the annotation process. In addition, it presents the performance of employing deep architectures for solving HAR on the provided dataset. It describes an approach for adapting deep architectures to solve HAR using attribute representations. For the dataset, the detailed annotation of these attributes leads to a total of 204 unique attribute representations for the 8 activity classes. This high level of granularity is the prerequisite for evaluating different activity recognition approaches. The LARa dataset contains labeled IMU and OMoCap data, the respective RGB videos, the recording protocol as well as the annotation and revision tool. All data are freely accessible: https://doi.org/10.5281/zenodo.3862782. The contribution answers also the following research questions in the context of the first freely accessible logistics HAR dataset—Logistic Activity Recognition Challenge (LARa): What is the the state-of-the-art of dataset creation for multichannel time-series HAR? What guidelines are proposed for creating a novel dataset for HAR? What are the properties of a logistics-dataset for HAR created by following these guidelines? How does a tCNN perform on this dataset using softmax compared to an attribute representation? This contribution is organized as follows. Section 2 presents the related work on multichannel-time series HAR. In Section 3, the freely accessible dataset LARa is introduced. First, data recording steps in the logistics scenarios are presented. Second, the activity classes and semantic attributes are explained. Third, findings of the annotation and revision process are highlighted. Section 3 concludes with an overview of the LARa dataset. Section 4 presents an example of solving HAR on the LARa dataset using deep architectures. Finally, Section 5 offers a discussion and the conclusions of the work in this contribution. Additionally, Appendix A gives an overview of state-of-the-art datasets for HAR. Based on the datasets’ descriptions, the guideline for creating the novel dataset in Section 3 is derived.

2. Related Work

Methods of supervised-statistical pattern-recognition have been used successfully for HAR. The standard pipeline consists of preprocessing, segmentation, statistical-features extraction, and classification. High and low-pass filters are common as preprocessing steps. High-pass filters serve denoising, as faulty measurements in the sensors are on the high-frequency spectrum. In addition, changes in human motions are rather in the low frequency. Low-filter operations are used for separating gravitation and inclination of the IMUs in constant space, i.e., the earth [19]. A segmentation approach, e.g., a sliding window, divides the input signal into segments of a certain time duration. Statistical features are computed from the time and frequency domain. They are, for example, the mean, variance, channels-correlation, entropy, energy, and coherence. [1,11,20,21]. The authors in [10] present a summary of such features. Using these features, the parameters of a classifier are computed. The classifier assigns an activity label to an unknown input. Some examples of classifiers are Naïve Bayes, Support Vector Machines (SVMs), Random Forests, Dynamic Time Warping (DTW), and Hidden Markov Models (HMMs) [11,22]. These methods, however, might show low performance on challenging HAR problems. In addition, different combinations of features must be selected manually per activity. This makes the method hardly scalable and is prone to overfitting [3,19]. The authors in [11] evaluate HAR for the order picking using statistical pattern recognition. They present a novel dataset of human order picking activities. They use a low number of sensor devices. Specifically, they deployed three inertial measurement units (IMUs), which are worn by workers in two different scenarios. They computed handcrafted-statistical features on segments that were extracted from the sliding window approach. The authors evaluated three classifiers, namely, an SVM, a Naïve Bayes, and a Random Forest. The authors in [19] solve HAR for activities on daily living. They compute statistical features on three data streams, namely the raw inertial-measurements, their AC and DC components. They propose a hierarchical approach with bagging performance of simple classifiers on a different combination of device locations on the human body. Deep architectures have been also deployed for solving HAR. Temporal Convolutional Neural Networks (tCNN), Recurrent Networks (RNN), e.g., Long Short-Term Memory (LSTMs), and a combination of both are examples of architectures in the field. tCNNs are hierarchical architectures that combine the feature extraction along with time and classification in an end-to-end approach. They learn the features and parameters of the classifier directly from raw data. tCNNs are presented in [17,18,23]. They are composed of convolution and pooling operations that are carried out along the time axis. tCNNs exploit their hierarchical composition becoming more discriminative concerning human actions. The combination of stacked convolutional and pooling layers find temporal relations that are invariant to temporal translation. They are also robust against noise. Moreover, these architectures share small temporal filters among all the sensors in the IMUs. Local temporal-neighborhoods are likely to be correlated independent of the sensors’ type. The authors in [2] introduce an architecture that combines temporal convolutions and LSTMs layers replacing the fully-connected layers. LSTMs are recurrent units with memory cells and a gating system, which are suitable for learning long-temporal dependencies in sequences. These units do not suffer from exploiting or vanishing gradients during training. The authors in [24] utilize a shallow recurrent network; namely, a three-layered LSTM and a one-layered bidirectional LSTM. Bidirectional LSTMs process sequences following their inputs in both forward and backward directions. The performance of the BLSTMs outperforms the convolutional architectures. Nevertheless, tCNNs show more robust behavior against parameter changes. The authors in [3] propose a tCNN that is adapted for IMUs, called IMU-CNN. The architecture is composed of convolutional branches corresponding to each IMU. These branches compute an intermediate representation per IMU. They are then combined in the last fully-connected layers. The authors compared IMU-CNN with the tCNN and a tCNN-LSTM, similar to [2]. The IMU-CNN shows a better performance, as it is more robust against IMU’s faults and asynchronous data. The authors in [20] investigate the effect of data normalization on the deep architecture’s performance. They compare the normalization to zero-mean and unit standard deviation, batch normalization, and a pressure-mean subtraction. The architecture’s performance improves when utilizing normalization techniques. Extending the work of [3], the authors use four sensor fusion strategies. They find that late fusion strategies are beneficial. Additionally, they evaluate the robustness of the architectures concerning proportions of the training dataset. The authors in [16] propose using attribute-based representation for HAR. In object recognition and word-spotting problems, attributes are semantic descriptions of objects or words. They represent coarsely a class. In [12], a search for attributes is presented, as there are no datasets with such annotations. The selected attributes are better suited for solving HAR. For such a search, the authors deploy an evolutionary algorithm. Firstly, they assign random binary representations to action classes as population. Secondly, they evaluate a population using deep architectures with a sigmoid activation function. The validation’s performance serves as evolution fitness. The authors deploy non-local mutations on the populations. They conclude that using attribute representations boosts the performance of HAR. Even, a random attribute-representation performs comparably to a directly classifying human actions. A drawback of this approach was the lack of a semantic definition of the attributes. Attribute-based representations have been deeply explored on HAR in [13]. Particularly, in the manual order picking process, attribute representations were expected to be beneficial for dealing with the versatility of activities. This contribution compared the performance of deep architectures trained using different attribute representations, and it evaluated their quantitative performance as well as their quality from the perspective of practical application. Expert-given attribute representations performed better than a random one, created following the conclusions in [16]. A semantic relation between attributes and activities enhances HAR not only quantitatively with regards to performance, but it also ensures a transfer of the attributes between activities by domain experts. In this preliminary work, the mapping between activity classes and attribute representations was one-to-one. This became a multiclass problem that limits the benefits of attribute-based representations. An important element of these supervised methods is annotated data [20]. A drawback of using deep methods is the need for extensive annotated data. This contrasts against the statistical pattern recognition. However, capturing and annotating data for HAR is laborious and expensive. Moreover, annotations regarding attributes are not existing. These fine-grained annotations represent an extra cost. In [13], human actions were given unique attribute representations. Nevertheless, human actions might include a different combination of attributes. Different combinations of attributes might be helpful for zero-shot learning and reducing the effects of the unbalanced problem. They also might allow clustering signals of a certain activity but with slight changes in the human movements. So far, there is no large-scale, freely accessible dataset of human activities in complex, industrial processes; neither using attributes. In addition, there are not standard guidelines for creating such a dataset. Thus, it needs to be defined beforehand. A review of existing datasets and their shortcomings in regards to the goal of this paper is presented in the appendix to further motivate the introduction of the new dataset in the following section.

3. Introducing the LARa Dataset

This section states the LARa dataset’s specifications. Requirements and specifications of LARa are based on a detailed review of datasets for HAR, see Appendix A. In particular, the origin of the laboratory set-ups, the subjects’ characteristics as well as the recording and annotation procedure are showcased. For data recording, the researchers created physical replicas of real-world warehouses in a laboratory. They are called scenarios in this contribution. This subsection gives insights into the replicas’ creation, and it explains the underlying warehousing processes. Next, the sensors’ configuration and the proper preparation of the subjects are presented.

3.1. Guidelines for Creating and Publishing a Dataset

The datasets, as discussed in the Appendix A, show no uniform guidelines for dataset creation. Based on this overview of the datasets and their description, a guideline for the creation of a dataset is derived. If possible, the recording should take place under real conditions. Realistic environments ensure recording natural movements, e.g., a real warehouse or a detailed replica. A replica requires a large laboratory. In addition, objects similar to the real scenarios are needed, e.g., a picking card. The subjects’ selection depends on the variety of people from the real environment, e.g., employees of a real warehouse. The selection terms involve age, sex, height, and handiness. In addition to the realistic environment, the behavior of the subjects should be implemented as naturally as possible. Instead of just recording individual activities in isolation, recording a whole process enables natural behavior and thus natural movements. A recording should therefore not only consist of one activity, e.g., lifting a box, but should occur as part of a process, e.g., . Using a recording protocol and RGB camera for documentation, discrepancies, such as the slipping of sensors or markers, are noticeable after the recordings. It is recommended to use different sensor types with a high frame rate. Since there is no uniform positioning of sensors, several sets of different positions on the human body can be experimented with. OMoCap and RGB videos could help in complex annotation-scenarios. The annotation is to be carried out by domain experts such as physiotherapists, dance teachers, or, in the case of logistics, logistics experts. As soon as several people annotate or are expected to benefit from the annotated data, an annotation guideline is necessary. A revision of the annotation is recommended to improve the quality of labeled data. To ensure other applications, the representation of the activity classes should be as granular as possible. The granularity depends on the number of activities and can be increased by a binary coarse-semantic description. Necessary general information such as location and period of the recordings must be specified. The method of data acquisition and the description of the activities are part of the description of the dataset. In addition to the method of annotation and its effort, the labeled activity classes must be described. The dataset should contain labeled and raw data from all sensors. Access to the annotation tool must be guaranteed for understanding the process of annotation.

3.2. Laboratory Set-Ups based on Logistics Scenarios

This subsection explains three logistics scenarios for data recording. The warehousing processes’ graphical representation is based on the guidelines defined by the Object Management Group [25]. The graphical and textual descriptions of the scenario guide researchers when applying methods of HAR that take context into consideration. A detailed explanation of the scenarios might be helpful for approaches involving context, preconditions and effects, e.g., Computational Causal Behavior Models (CCBM) [26]. This context may be the constraints of the warehousing process. For example, some activities can only be performed in a specific order or at a specific location and time. Data were recorded in physical set-ups created in a controlled environment—the ’Innovationlab Hybrid Services in Logistics’ at TU Dortmund University [15]. A group of researchers created the physical replica of warehousing scenarios following a cardboard engineering approach [27,28].

3.2.1. Logistics Scenario 1—Simplified Order Picking System

The first scenario is not based on a real warehouse. Nevertheless, this process may exist in reality. The process is illustrated in Figure 1, the physical laboratory set-up is presented in Figure 2.

Figure 1

Business process model of logistics Scenario 1—simplified order picking.

Figure 2

Physical laboratory set-up of logistics Scenario 1—simplified order picking.

In the beginning of an order-picking process, the subject places boxes on an empty order-picking cart. These empty boxes are provided at the base. In a real warehouse, this base may be a conveyor that transports empty boxes to the order picker while transporting full boxes to the shipping area. In the laboratory, stacking frames recreated the conveyor. This simplification does not influence human-motion behavior. The boxes and the cart were standard items that are common in the industry. Next, the subject moves the cart to a retrieval location. The researchers who guided the recordings specify where to go. An order-picking aisle was recreated by placing boxes on frames. When the subject arrives at a retrieval location, they pick articles from a box or they open a fronted bin. The subjects place the articles in an empty box on their cart. The articles were small, light items, such as bags of 500 g. This procedure of taking the cart to a new location and retrieving goods is repeated until all boxes on the cart are full. The subject takes the cart back to the base and places the full boxes on the conveyor. The order-picking process starts anew. When all articles in the aisles’ boxes are empty, the order-picking process has to end. The research team refills the boxes.

3.2.2. Logistics Scenario 2—Real-World Order Picking and Consolidation System

The second scenario is based on a real warehouse. Access to the site and process documentation was granted by industry partners of the chair of Materials Handling and Warehousing. In contrast to Scenario 1, the second scenario takes information technology processes such as scanning barcode labels or pushing buttons for pick confirmation into account. For the sake of clarity, the order-picking process and the consolidation process of the picked goods are illustrated separately in Figure 3 and Figure 4, respectively. The physical laboratory set-up of Scenario 2 is illustrated in Figure 5.

Figure 3

Business process model of logistics Scenario 2 (Part 1)—real warehouse order picking.

Figure 4

Business process model of logistics Scenario 2 (Part 2)—real warehouse order picking.

Figure 5

Physical laboratory set-up of logistics Scenario 2—real warehouse order picking.

The order-picking cart is bigger than the one used in the first scenario as visible in Figure 2 and Figure 3. It has three shelves of equal size that are filled with cardboard boxes of different shapes and sizes. Each box is held open with a rubber band. In the real warehouse, a so-called put-to-light (PtL) frame is attached to the cart. It gives a visual signal where to place articles and has buttons to press for retrieval and submission confirmation. Small calculators are attached to the cart to replicate this system in the laboratory. On its shorter end, the cart has two handles, a small screen, a stamp pad, a plastic bag for packaging waste and a second bag, which is filled with more small plastic bags. Apart from the screens, all items could be purchased. A labeled cardboard replicates the screen. The research group gives information to the subject, which is usually displayed on the screens. For example, this information might be the retrieval location or the picking quantity. Subjects deploy a stamp and a knife. They are attached to the OMoCap suit. Additionally, subjects operate a handheld scanner, which is attachable to the cart. To assure a natural motion of the subjects when using the scanner, all items have barcode labels that need to be scanned. Thus, the subjects have to use the scanner correctly to trigger an acoustic signal that confirms a scan operation. An order consists of several items that need to be picked in varying quantities. For each order-picking cycle, one cart works on the orders of several customers at the same time. This is referred to as order batching. The articles are household goods of varying dimensions and weights, such as cutlery, dishes, or storage boxes. They are stored in plastic and cardboard boxes and open lid bins. Some of the cardboard boxes were sealed with tape for protecting the goods. These storage units are placed on shelves with different heights or on the ground. Stacking frames and shelves formed two aisles. In the real-world system, a flow-through rack is deployed for goods consolidation. In the laboratory, pipe-racking systems were used to recreate it. Each chute of the flow-through rack is equipped with a barcode label and a human-readable ID. In general, the subject scans all labeled units to ensure that the correct article is picked, e.g., a single article or a newly labeled plastic bag. There are three cases for scanning an article’s barcode label. In the first case, the articles are individually packed. Every article already has a barcode label attached. Second, some articles are in a secondary packing, e.g., a cardboard box or a plastic bag that needs to be opened before retrieval. The articles in this secondary packaging have an individual barcode. Third, some articles do not have an attached-barcode label. In this case, the barcode at the shelf has to be scanned. There is a barcode label roll, which is provided next to the respective articles. These labels need to be attached to the retrieval unit. To begin the order-picking process, the subjects scan the barcode of the cart to trigger the order-picking mode. The screen shows the next retrieval location. When they arrive there, they scan the article’s barcode label, which may be found on the article, or on the shelve as explained previously. If the article is correct, the screen indicates the correct withdrawal quantity. Next, the subject retrieves the correct amount of articles. If necessary, they open sealed cardboard boxes with the knife. They dispose packaging waste using the plastic bag at the cart. If the article already has a barcode label, the subject can scan it so that the PtL-Frame visually indicates the correct box to submit the articles. For articles that do not have a barcode label, the subject wraps the desired quantity of articles in a plastic bag and seals it with a barcode label provided at the shelf. Pressing a button confirms each submission into a box on the cart. The button is on the PtL-frame above the box. If this is the first item in a box, the box must be marked with a stamp. This is a quality assurance to trace back the employee who packed the box. The subject takes the cart from one retrieval location to the next until the order is complete. The order-picking process is proceeded by the consolidation of the packed goods for dispatching preparation. For consolidation, the boxes must be inserted on the back side of a flow-through rack. On the front site the packaging, workplaces are located where dispatch preparation takes place. As with the order-picking mode, the subjects scans a specific barcode on the code to trigger the consolidation mode. Next, they take the cart to the consolidation point, which is shown on the cart’s display. The subjects scan the barcode of a box so that the scanner’s display shows the correct chute. After they inserted the box, they scan the barcode label at the chute to confirm the submission. This procedure repeats until there are no more boxes on the cart.

3.2.3. Logistics Scenario 3—Real-World Packaging Process

The third scenario is the packaging process that follows the order picking and consolidation of scenario 2 in the same real-world warehouse. The packaging process serves the dispatch preparation of the picked articles. In general, the consignment size per order does not exceed 5 boxes. Thus, the shipping by pallet is not feasible. The real-world packaging process is illustrated in Figure 6. Its physical laboratory set-up can be observed in Figure 7.

Figure 6

Business process model of logistics Scenario 3—real warehouse packaging work station.

Figure 7

Physical laboratory set-up of logistics Scenario 3—real warehouse packaging work station.

Each packaging workplace is equipped with a computer, a printer, a bubble wrap dispenser, a tape dispenser, a scale for weighing boxes, and a trash bin. Next to the table, a conveyor is located where all boxes have to be placed that are ready for shipment. The packaging table in the laboratory is a model often found in real warehouses. Further tables were placed next to the packaging table to provide space for the equipment. The table on the far left was used to recreate the surface of the conveyor. When a box was pushed onto the surface, a researcher took the box. The actual motion of a conveyor is not necessary to ensure a human motion that is close to reality. The dimensions of the tables in the laboratory closely resemble the table from the real-world warehouse. For the tools, equipment has been purchased that is similar to the real-world system. The bubble wrap dispenser was recreated by cutting a small opening in a cardboard box. The wrap was refilled manually by the researchers present during the recordings. A fully functional computer was placed on the table. Mouse and keyboard were attached to the computer and a spreadsheet application was running on it. When computer work was necessary, the subjects were tasked to perform basic tasks in the program. The printers were substituted by a researcher handing the printed items to the subject. As the weight scale is an area on the table’s surface, it could be recreated by indicating a certain area with colored stripes. As explained previously, all boxes to be prepared for shipment were stored in a flow-through rack. During recordings, the rack from Scenario 2 was used. It was moved next to the packaging table. When recordings were conducted, second and third scenarios were in immediate succession, the flow-through rack was already filled with boxes, which were filled with articles. By the beginning of the packaging process, the subject goes to the computer and chooses a packing order. Next, they take all boxes that belong to one order from the flow-through rack and place them on the packing table. The rubber band of each box is removed and the barcode needs to be scanned with the hand scanner. When doing so, the packing list of the order is printed automatically. For each box, the subject evaluates its filling level to decide whether repacking is necessary. This is the case when the box is either rather empty or overfull. In the first case, more articles from a different box of the same order are added. In the second case, the articles protrude the box. Articles may be bigger than the box, due to incorrect article master-data. When the filling level is low, the contents of several boxes are combined. When a box is removed from the order, this information must be entered into the computer. Contents of an overfilled box are put into a bigger one. The subject can get boxes of different sizes from storage next to the packing table. When repacking articles from one box to another, each one needs to be scanned and the repacking must be confirmed at the computer. The subject confirms that all boxes of an order are filled properly. In case the packing list has been altered due to repacking, it is reprinted automatically. Next, the subject puts the packing list in each box and fills them up with bubble wrap. Then, each box must be pushed onto a scale. The subjects need to trigger the weighing process at the computer. The system will check if the actual weight of the box corresponds to the expected weight according to the master data and the packing list. Once all boxes are packed correctly and their weight has been approved, the subjects seal them using a tape dispenser. The printer automatically prints the shipping labels when all boxes of one order are ready to be sealed. The subject applies a label to each box. Eventually, each box is pushed onto the conveyor surface.

3.3. Configuration of Sensors and Markers

The OMoCap system tracked 39 reflective markers from a suit, see Figure 8. A VICON system consisted of 38 infrared cameras recording at a sampling rate of 200 fps. Three different sets of on-body devices or IMUs record tri-axial linear and angular acceleration, see Figure 9. IMU-sets 1 and 3 served as proof of concept and they are not part of the dataset. The six IMUs of the second set from MbientLab [29] are attached to the arms, legs, chest, and waist. They record tri-axial linear and angular acceleration at a rate of 100 Hz.

Figure 8

Marker position on a Optical marker-based Motion Capture (OMoCap) suit.

Figure 9

Positions of on-body devices (inertial measurement unit (IMU)) from set 1 (Texas Instruments Incorporated), set 2 (MbientLab), and set 3 (MotionMiners GmbH).

3.4. Characteristics of Participating Subjects

A total of 14 subjects (S) were involved in the recording process. Their characteristics, including sex, age, weight, height, and handedness, are listed in Table 1. Examining the minimum and maximum of these characteristics show that a wide spectrum of physical characteristics is present. Thus, the subjects’ motion patterns vary widely. In addition, the ratio of left-handed to right subjects closely resembles the general population [30,31].

Table 1

Subject: specifications and scenario assignment.

ID	Sex	Age	Weight	Height	Handedness	OMoCap	IMU-set			Scenario 1	Scenario 2	Scenario 3
ID	[F/M]	[year]	[kg]	[cm]	[L/R]	OMoCap	[1]	[2]	[3]	[Number of Two-Minute Recordings]
S01	M	28	78	175	L	x	x			29	0	0
S02	F	24	62	163	L	x	x			30	0	0
S03	M	59	71	171	R	x	x			27	0	0
S04	F	53	64	165	L	x	x			29	0	0
S05	M	28	79	185	R	x	x			26	0	0
S06	F	22	52	163	R	x	x			30	0	0
S07	M	23	65	177	R	x		x	x	2	13	14
S08	F	51	68	168	R	x		x	x	2	13	14
S09	M	35	100	172	R	x		x	x	2	14	13
S10	M	49	97	181	R	x		x	x	2	13	12
S11	F	47	66	175	R	x		x	x	2	12	0
S12	F	23	48	163	R	x		x	x	0	6	14
S13	F	25	54	163	R	x		x	x	2	14	14
S14	M	54	90	177	R	x		x	x	2	14	14
Min.		22	48	163
Avg.		37	71	171
Max.		59	100	185
Sum										185	99	95

All subjects participated in a total of 30 recordings of 2 min each, which corresponds to about 30 × 2 × 14 = 840 min of recorded material. In Scenario 1, subjects 1 to 6 performed 30 recordings wearing the OMoCap suit and the IMU-set 1. Subjects 7 to 14, wearing the OMoCap suit and the IMU-sets 2 and 3, participated in 2 recordings in Scenario 1, 14 recordings in Scenario 2 and 14 packing recordings in Scenario 3. Due to heavy noise and issues with the sensor readings, some recordings had to be scrapped, and they are not included in the dataset. Thus, the number of recordings per subject deviates in Table 1. A total of 379 recordings (758 min) were annotated and are included in the dataset. Figure 10 shows the varying physical features of all subjects true-to-scale.

Figure 10

Subjects before the recordings.

3.5. Recording Procedure

The LARa dataset was recorded in 7 sessions. In the first 3 sessions, subjects 1 to 6 went through Scenario 1. In sessions 4 to 7, data were recorded in all three scenarios with subjects 7 to 14.

3.5.1. Preliminaries

Before the recording, each subject was measured according to the information necessary for the VICON Nexus software: body mass, height, leg length, knee width, ankle width, shoulder offset, elbow width, wrist width, and hand thickness. Subsequently, the test subjects were equipped with an OMoCap suit, a headband, and work safety shoes, as used in real warehouses. Markers and IMUs were attached to the suit. To document the proper positioning of all markers and IMUs, each subject was photographed from four sides before the recording.

3.5.2. Recording Process

For the sake of recording realistic motions, the subjects were introduced to the scenarios by a domain expert in advance. Test runs were carried out before recordings commenced. The subjects were allowed to familiarize themselves with the processes and objects. The subjects do not perform individual and isolated movements as in other datasets that originate from laboratories, e.g., [32]. Rather, realistic motion sequences were the goal. To achieve this, the subjects were only instructed about their tasks within a scenario. They were not told how to perform specific motions necessary to fulfill their task. Thus, the way they handled items, picked boxes and moved to a location were not influenced by the researchers. The motion is solely determined by each subject’s individual preference. In addition, the subjects were not given detailed information about the underlying research goal to avoid a bias in their motion behavior. Between each recording unit of two min, a break of only a few seconds was necessary to start the next capture. Hence, the subjects would be able to remain focused on the task. The subjects did not take off the suit between recordings. After the recordings concluded, each subject was photographed again from four sides to reassure the proper positioning of the markers and sensors.

3.5.3. Documentation and Protocol

A protocol was kept before, during and after the recordings for each subject to ensure repeatability of the recording sessions: time, size of the suit and shoes, room temperature, the use of velcro to fit the suit to the person, RGB video files that were created, number and descriptions of photos taken, remarks, and incidents. The expenditure of time for recording is made up of the OMoCap system’s calibration, the preparation of each subject, their introduction to the scenarios and the recordings. In total, the expenditure was over 197 PHR ( days) to record 14 h of material. To support the subsequent annotation of the data, the sessions were captured by a RGB camera. In sessions 1 to 3, only occasional recordings were created with a RGB camera, but at least one recording per subject is available. Due to the increasing complexity of human motion and the increasing spectrum of objects in Scenarios 2 and 3, i.e., the subjects 7 to 14, were captured entirely by a camera to ensure that the performed activities are apparent to the annotators. In addition to the 14 subjects, the RGB camera recorded other people who were in the test field at the same time. They provided guidance when the task was unclear, ensured that none of the markers or sensors detached and continuously maintained the experimental setup e.g., by refilling the shelves with packed goods. In addition, photos taken before and after the recordings are included in the protocol. The Remarks section in the protocol includes the number and time of the breaks taken by the subjects, re-calibration of the OMoCap system during the session, injuries of the subjects and unusual movement during recording, e.g., drinking. Incidents mainly include lost or shifted markers and sensors. If a loss was observed during a recording, it was aborted, deleted, and restarted from the beginning. In three instances, a detachment was noticed after the recording session: Incidents with respect to S11: After recording 27, it was noticed that the marker of the left finger (see Figure 8, marker number 22) was misplaced. The reseach group could not determine when exactly the marker shifted its position. After recording 30, it was noticed that the marker of the right ankle (see Figure 8, marker number 35) was lost. Incidents with respect to S13: After the last recording (number 30), it was noticed that the marker from the right finger (see Figure 8, marker number 23) and the marker from the left wrist (see Figure 8, marker number 18) were missing. One of the lost markers was found on the left side of the subject’s chest. Incidents with respect to S14: After recording number 15, it was noticed that the marker of the right forearm (see Figure 8, marker number 17) was stuck to the leg. For the subsequent recordings (number 16 to 30), the marker was put back to its proper position. Despite these incidents, the data acquired through these recordings were found to be usable.

3.6. Classes and Attributes

This subsection explains the definitions of human activities in the dataset. The dataset considers periodic and static activities, following [1]. The dataset contains annotations of semantic coarse-descriptions of the activities. These semantic definitions are called attributes and they are motivated by HAR methods in [13,16]. An attribute representation can be seen as an intermediate binary-mapping between sequential data and human activities. This intermediate mapping is beneficial for solving HAR problems because they allow sharing high-level concepts among the activity classes. The consequences of unbalanced class-problem can be reduced. A dataset for HAR contains a set of N sequential samples —for LARa dataset either the OMoCap or the IMUs. D represents the number of joints or sensors for each dimension . This parameter is also addressed as the number of sequence channels; their respective activity classes from a set of C activity classes. Following the method in [16], this dataset provides additionally attribute annotations , where is drawn from an attribute representation . A is a binary attribute-representation of size , with M number of attribute representations of size K for all of the activity classes. A single attribute representation serves as an intermediate representation between an input signal and the expected activity class , i.e., . There are M different attribute representations. This is different from [16], where the authors assign a single, random attribute-representation to an activity class. In this work, the number M of representations is stated after the annotation process. In the annotation process, a set of attributes are given to short windows of the recordings, concerning the human movements. Table 7 shows the number of different attribute representations per activity class in LARa. The definition of activities and their semantic attributes is derived from the researchers’ experience [13,33], and from HAR methods [1,16]. The attributes’ and activities’ terminology by default implies industrial context. This excludes irrelevant activities for warehousing, such as smoking or preparing coffee. This is referred to as a Closed-World Condition [34].

3.6.1. Activity Classes

There are eight activity classes, see Table 2. Standing, Walking and Cart emphasize the subject’s locomotion. The Handling activities refer to a motion of the arms and hands when manipulating an article, box, or tool. These activities do not consider holding an element while standing or walking. Synchronization is crucial for proper annotation and for transferring the labels to different sensor streams.

Table 2

Activity Classes and their semantic meaning.

Activity Class		Description
c1	Standing	The subject is standing still on the ground or performs smaller steps. The subject can hold something in hands or stand hands-free.
c2	Walking	The subject performs a gait cycle [35] (pp. 3–7) while carrying something or the subject is walking hands-free. The only exception is made in regards to a cart (see below).
c3	Cart	The subject is walking (gait cycle) with the cart to a new position. This class does not include the handling of items on the cart like putting boxes or retrieving items. Likewise, the handling of the cart, e.g., turning it to better reach its handles, is not included.
c4	Handling (upwards)	At least one hand reaches the height of the shoulder height (80% of a person’s total height [36] (p. 146)) or is lifted beyond that during the handling activity.
c5	Handling (centred)	Handling is possible without bending over, kneeling, or lifting arms to shoulder joint height.
c6	Handling (downwards)	The hands are below the height of the knees (lower than 30% of a person’s total height [36] (p. 146)). The subject’s spine is horizontal or they are kneeling.
c7	Synchronization	Waving Motion where both hands are above the subject’s head by the beginning of each recording.
c8	None	Excerpts that shall not be taken into account, because the class is not recognisable. Reasons are errors or gaps in the recording or a sudden cut by the end of a recording unit.

3.6.2. Attributes

There are attributes . These are coarse-semantic descriptions of the activities. They are mostly related to the locomotion and the pose when moving. The human pose changes according to handling different elements and to different heights. The attributes are subdivided in five groups, see Table 3 and Figure 11.

Table 3

Attributes and their semantic meaning.

Attribute		Description
I - Legs
A	Gait Cycle	The subject performs a gait cycle [35] (pp. 3–7).
B	Step	A single step where the feet leave the ground without a foot swing [35] (pp. 3–7). This can also refer to a step forward, followed by a step backwards using the same foot.
C	Standing Still	Both feet stay on the ground.
II - Upper Body
A	Upwards	At least one hand reaches the height of the shoulder height (80% of a person’s total height [36] (p. 146)) or is lifted beyond that during the handling activity.
B	Centred	Handling is possible without bending over, kneeling or lifting arms to shoulder joint height.
C	Downwards	The hands are below the height of the knees (lower than 30% of a person’s total height [36] (p. 146)). The subject’s spine is horizontal or they are kneeling.
D	No Intentional Motion	Default value when no intentional motion is performed, e.g., when standing without doing anything, carrying a box or walking with a cart. This is because there is no intentional motion when performing these activities, only a steady stance.
E	Torso Rotation	Rotation in the transverse plane [37] (pp. 2–3). Either a rotating motion, e.g., when taking something from the cart and turning towards the shelf or a fixed position when handling something while the torso is rotated.
III - Handedness
A	Right Hand	The subject handles or holds something using the right hand.
B	Left Hand	The subject handles or holds something using the left hand.
C	No Hand	Hands are not used, neither for holding nor for handling something.
IV - Item Pose
A	Bulky Unit	Items that the subject cannot put the hands around, e.g., boxes.
B	Handy Unit	Items that can be carried with a single hand or that the subjects can put their hands around, e.g., small articles, plastic bags.
C	Utility Auxiliary	Use of equipment, e.g., scissors, knives, bubble wrap, stamps, labels, scanners, packaging tape dispenser, adhesives etc.
D	Cart	Either bringing the the cart into proper position before taking it to a different location (Handling) or walking with the cart to a new location (No Intentional Motion).
E	Computer	Using mouse and keyboard.
F	No Item	Activities that do not include any item, e.g., when the subject fumbles for something when on the search for a specific item.
V - None
A	None	Equivalent to the None class.

Figure 11

Semantic attributes.

During the labeling, annotators follow these rules: at least one for the attributes per group must be assigned; In group I, the attributes are disjoint, since a subject performs either one of the motions at the same time; The attributes A-D of group II are disjoint while the torso rotation is independent. In the third group, the choice between right and left is non-exclusive as one can use both arms at the same time. In group IV, the attributes are disjoint. Annotators give priority to the items according to a hierarchy: ; the None and the Synchronization classes have a fixed attribute representation. The execution of the waving motion for synchronizing is predefined.

3.6.3. Exemplary Activity Sequence and Its Proper Annotation

Table 4 shows an exemplary warehousing process that consists of four process steps. This process is an excerpt from Scenario 2. In the first process step, the subject is initially standing (Act. 1) before walking to the cart without holding anything in hands (Act. 2). Then, the cart is brought into proper position with both hands while performing smaller steps (Act. 3) and the subject pulls the cart to the retrieval location using the right hand (Act. 4).

Table 4

Exemplary picking process broken down into process steps, activities, classes, and attributes.

				Attribute Representation
				I Legs			II Upper Body					III Hand.			IV Item Pose
				Gait Cycle	Step	Standing Still	Upwards	Centered	Downwards	No Intentional Motion	Torso Rotation	Right Hand	Left Hand	No Hand	Bulky Unit	Handy Unit	Utility/Auxiliary	Cart	Computer	No Item
Process Step		Act.	Class	A	B	C	A	B	C	D	E	A	B	C	A	B	C	D	E	F
1	Bring cart to	1	c1 Standing	0	0	1	0	0	0	1	0	0	0	1	0	0	0	0	0	1
	retrieval	2	c2 Walking	1	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	1
	location	3	c5 Hand. (cen.)	0	1	0	0	1	0	0	0	1	1	0	0	0	0	1	0	0
		4	c3 Cart	1	0	0	0	0	0	1	0	1	0	0	0	0	0	1	0	0
2	Scan Barcode	5	c1 Standing	0	0	1	0	0	0	1	0	1	0	0	0	0	0	1	0	0
		6	c5 Hand. (cen.)	0	1	0	0	1	0	0	0	0	1	0	0	1	0	0	0	0
		7	c5 Hand. (cen.)	0	0	1	0	1	0	0	0	0	1	0	0	1	0	0	0	0
		8	c4 Hand. (upw.)	0	0	1	1	0	0	0	1	0	1	0	0	0	1	0	0	0
		9	c5 Hand. (cen.)	0	1	0	0	1	0	0	0	0	1	0	0	1	0	0	0	0
3	Retrieve item	10	c4 Hand. (upw.)	0	1	0	1	0	0	0	0	1	0	0	0	1	0	0	0	0
	and put in	11	c4 Hand. (upw.)	0	0	1	1	0	0	0	0	1	0	0	0	1	0	0	0	0
	box	12	c4 Hand. (upw.)	0	1	0	1	0	0	0	0	1	0	0	0	1	0	0	0	0
		13	c6 Hand. (down.)	0	1	0	0	0	1	0	0	1	1	0	0	1	0	0	0	0
		14	c6 Hand. (down.)	0	0	1	0	0	1	0	0	1	1	0	0	1	0	0	0	0
4	Confirm pick	15	c6 Hand. (down)	0	0	1	0	0	1	0	0	1	0	0	0	0	1	0	0	0

By the beginning of the process step 2, the subject is standing while resting the right hand on the cart’s handle (Act. 5). Then the subject proceeds to take the scanner from the cart. The first half of this left-handed handling motion is done while performing a step (Act. 6), while the latter is performed while standing with both feet on the ground (Act. 7). It is important to note that the scanner is annotated as a Handy Unit because it is handled as such. In contrast, using it in the following activity is annotated with Utility Auxiliary. The label is located on the subject’s right and on eye level so a Torso Rotation is necessary and the handling is performed upwards (Act. 8). The ninth activity refers to the subject mounting the scanner back on the cart (Act. 9). In the third process step, the subject picks the item from the shelf (Act. 10–12) and places it in a box located on the lowest level of the cart (Act. 13 and 14). Finally, the pick is confirmed by clicking the put-to-light button located above the box (Act. 15). There is a wide variety of activity sequences that may constitute the same process. For example, different subjects use different hands when handling an element. In addition, their body motions differ when lifting something from the same height depending on their body size. Thus, the exemplary sequence of activities in Table 4, their class labels and attribute representation are one of many viable options.

3.7. Annotation and Revision

A Python tool was created for annotating the OMoCap data, see Figure A3. The procedure of the annotation and revision is described by Reining et al. [38]. The annotation tool offers a visualization of the skeleton from the OMocap data and a window-based annotation frame. A window is a segment that is extracted from the sequential data. In the annotation process, an annotator provides the activity class and the attribute representation of a window. Window sizes are variable. The annotator selects consequently the size of a window. Twelve annotators labeled the OMoCap data of the 14 subjects. Apart from two annotators, none of them had any prior experience regarding the annotation of OMoCap data. Each annotator followed the guidelines, as mentioned in Section 3.6. Additionally, RGB videos served as an additional aid for complex activities.

Figure A3

Screenshot of the annotation and revision tool during the annotation.

The total time effort for annotation comprised over 474 PHR ( days or months). Table 5 illustrates the annotation effort per individual annotator. The information given in the table relates to two-minute recordings. With a range of 39 min to almost 3 h of annotation per recording, the annotators differ greatly in their annotation speed. The reasons for the different annotation speeds are the different level of experience of the annotators, the different setting of window sizes of activities and the individually selectable playback speed of the OMoCap recordings in the annotation tool. An average of min was required for a one-minute recording.

Table 5

Annotation effort of all annotators.

ID	Total Time	No of Rec.	Time per Rec.
ID	[hh:mm:ss]	No of Rec.	[hh:mm:ss]
A01	55:12:19	52	01:14:02
A02	73:22:04	45	01:55:21
A03	56:30:39	54	01:14:13
A04	34:39:08	26	01:28:00
A05	84:18:37	30	02:48:37
A06	39:24:16	64	00:39:46
A07	28:40:57	25	01:10:35
A08	32:56:40	27	01:15:24
A09	33:28:45	27	01:14:24
A10	10:14:21	12	00:51:12
A11	23:03:16	14	01:38:48
A12	02:16:00	3	01:45:03
Min.			00:39:46
Max.			02:48:37
Sum	474:07:02	379

Following the annotation, data were revised by four domain experts, see Table 6. The revision of an annotated two-minute recording varied between 4 and 121 min, depending on the quality of the annotation. Compared to the annotation, the average time for a revision is significantly lesser at min for a one-minute recording.

Table 6

Revision effort of all revisers.

ID	Total Time	No of Rec.	Time per Rec.
ID	[hh:mm:ss]	No of Rec.	[hh:mm:ss]
Re01	13:44:00	88	00:09:22
Re02	39:18:00	97	00:24:19
Re03	28:37:00	91	00:18:52
Re04	61:19:00	103	00:35:43
Min.			00:09:22
Max.			00:35:43
Sum	142:58:00	379

The dataset is unbalanced. The Handling classes represent nearly of the recordings. These classes show also a higher variability of their attribute representations; this means that these classes show up in many different forms. The class Handling (centered) is the most frequent activity by far. The representations of the Walking activity class differ in regards to the handedness and Item pose. This is because the Gait Cycle and the No Intentional Motion attribute are fixed. The third class Cart can only have three representations. Either the cart is pushed or pulled using the Left Hand, the Right Hand, or with both hands while walking. By definition, there is only one valid representation for both Synchronization and None classes. This is reflected in the results of the annotation and revision, see Table 7.

Table 7

Annotation results divided by activity classes.

	Stand.	Walk.	Cart	Handling (upwards)	Handling (centred)	Handling (downwards)	Synchron.	None
Samples	974,611	994,880	1,185,788	754,807	3,901,899	673,655	158,655	403,737
Avg. Time/Occ. [s]	1.71	3.72	6.46	2.72	4.39	2.74	2.16	7.10
Proportion [%]	10.77	11.00	13.11	8.34	43.12	7.45	1.75	4.46
[M] number of Attr. representations	28	7	3	45	72	47	1	1

3.8. Folder Overview of the LARa Dataset

LARa contains data of an OMoCap system, one IMU-set, and one RGB camera as well as the recording protocol, the tool for annotation and revision and the networks of activity classes and attributes. Table 8 illustrates an overview of the sizes of the folders and the formats of the files.

Table 8

Folder overview of the LARa dataset.

Folder	Folder Size [MiB]	File Format	Recording Rate
OMoCap data	33,774	csv	200 fps
IMU data - MbientLab	1355.77	csv	100 Hz
RGB videos	17,974.82	mp4	30 fps
recording protocol	2.58	pdf	-
annotation and revision tool	2899.99	py	-
class_network	1449.55	pt	-
attrib_network	1449.55	pt	-

The files of the OMoCap data, IMU data, and RGB videos are named after the logistics scenarios, subject, and recording. For example the file name means logistics scenario 01, subject 02, recording 12.

4. Deploying LARa for HAR

The tCNN, proposed in [18], was deployed for solving HAR using the LARa dataset. Some minor changes on the architecture are, here, proposed. Our tCNN contains four convolutional layers, no downsampling operations, and three fully-connected layers. Downsampling operations are not deployed as they affect the performance of the network negatively following the conclusions of [16]. The convolutional layers are composed of 64 filters of size , which perform convolutions along the time axis. The first and second fully connected-layers contain 128 units. Considering the definitions in Section 3.6, there are two different last fully connected layers, depending on the task. A softmax layer is used for direct classification of the activity classes. It has units. A fully connected layer with sigmoid activation function is used for computing attributes. This layer contains 19 units. The number of output units corresponds to either the number of classes or attributes, respectively, see Section 3.6. Figure 12 shows the tCNN’s architecture.

Figure 12

The Temporal Convolutional Neural Network (tCNN) architecture contains four convolutional layers of size . According to the classification task, there are two types of last fully-connected layer: a softmax and a sigmoid.

The architecture processes sequence segments that consist of a feature map input of size , with T the sequence length and D the number of sequence channels. The sequence segments are extracted following a sliding-window approach with window size of , step size of ( overlapping). The number of sequence channels D is 126, as there are measurements of position and rotation in for the 21 joints of the LARa OMoCap dataset. This excludes the joint “lower_back” as it is used for normalizing the human poses with respect to the subject. In general, the input sequence is for the dataset. The tCNN computes, either, an activity class or a binary-attribute representation from an input sequence. Predicting attribute representation follows the method in [16]. Differently from a standard tCNN, this architecture contains a sigmoid activation function replacing the softmax layer. The sigmoid activation function is computed as . This function is applied to each element of the output layer. The output can be considered as binary pseudo-probabilities for each attribute being present or not in the input sequence. The architecture is trained using the binary-cross entropy loss given by: with the target attribute representation and the output of the architecture. Following [3,12], input sequences are normalized per sensor channel to the range of . Additionally, a Gaussian noise with parameters is added. This noise simulates sensor’s inaccuracies. Following the training procedures from [1,2], the LARa OMoCap is divided into three sets: the training, validation, and testing. The training set comprises recordings from subjects . The validation and testing sets are comprised of recordings from and , respectively. An early stopping approach is followed using the validation set. This set also is deployed for finding proper training hyperparameters. Recordings with label None are not considered for training following the procedure in [3]. The architecture is trained using the batch gradient-descent with RMSProp update rule with an RMS decay of , a learning rate of , and a batch size of 400. Moreover, Dropout was applied to the first and second fully-connected layers. In the case of predicting attributes and for solving HAR, a nearest neighbor (NN) approach was used for computing a class . The Euclidean distance is measured from the predicted attribute vector to attribute representation , with and . This is possible as each activity class is related to a certain number of binary-attributes vectors in the attribute representation A, see Table 7. LARa provides the attribute presentation A. Both the attribute vector and the attribute representations A are normalized using the 2-norm. The tCNN is also trained using a softmax layer predicting activity classes directly. In this case, the architecture is trained using the Cross-Entropy Loss. Table 9 and Table 10 present the performance of the method solving HAR on the LARa OMoCap dataset using the softmax layer and the attribute representation. Precision is computed as . Recall is computed as . Having , , as the true positives, false positives, and false negatives. The weighted F1 is calculated as , with being the number of window samples of class . Handling and moving Cart activities show the best performances. Using the attribute representation boost the performance in comparison with the softmax classifier. The approach classifies the Synchronization and Standing activities when using attribute representations. In general, deploying an attribute representations boosts the performance of HAR. These results coincide with [13,16]. Attributes belonging of frequent classes help with the classification of less frequent classes. The effects of the unbalanced problem are also reduced.

Table 9

Recall and precision of human activity recognition (HAR) on the LARa OMoCap dataset.

Output	Metric	Performance
Output	Metric	Stand.	Walk.	Cart	Hand. (up.)	Hand. (cent.)	Hand. (down.)	Sync.
Softmax	Recall[%]	3.11	71.96	71.34	61.39	87.40	65.30	0.0
Softmax	Precision[%]	73.00	45.29	81.35	57.10	70.85	80.72	0.0
Attributes	Recall[%]	55.86.	54.31.	76.12	69.16	80.99	74.36	69.84
Attributes	Precision[%]	24.22	60.59	92.13	79.08	82.94	74.63	89.31

Table 10

The over-all accuracy and weighted F1 of HAR on the LARa OMoCap dataset.

Metric	Perform.
Metric	Softmax	Attributes
Acc[%]	68.88	75.15
wF1[%]	64.43	73.62

Table 11 and Table 12 show the confusion matrices of the predictions using the tCNN in combination with: the softmax layer and the NN using the attribute representation. In general, the method exhibits difficulties predicting the class Standing. The method mispredicts Standing sequence segments as Handling (centered) ones. The class Walking present also some mispredictions. Following the results on Table 9 and Table 10, solving HAR using the attribute representation offers a better performance in comparison with the usage of a softmax layer. The classification of activity classes, e.g., Synchronization, Standing, and Walking, improve significantly.

Table 11

Confusion matrix from the class predictions using tCNN with the softmax layer.

Activities	Confusion Matrix
Activities	Stand.	Walk.	Cart	Hand. (up.)	Hand. (cent.)	Hand. (down.)	Sync.
Stand.	311	1807	211	134	7446	86	0
Walk.	42	3776	461	35	918	15	0
Cart	1	928	9005	4	2684	1	0
Hand. (up.)	0	92	152	4188	2389	1	0
Hand. (cent.)	72	1681	1233	1717	36,587	572	0
Hand. (down.)	0	51	2	12	1437	2826	0
Sync.	0	2	6	1245	178	0	0

Table 12

Confusion matrix from the class predictions using the attribute predictions with tCNN and the nearest neighbor (NN) approach.

Activities	Confusion Matrix
Activities	Stand.	Walk.	Cart	Hand. (up.)	Hand. (cent.)	Hand. (down.)	Sync.
Stand.	2421	1492	506	268	4668	219	421
Walk.	633	3179	691	47	649	41	7
Cart	44	298	11,630	20	629	2	0
Hand. (up.)	71	44	32	5395	1218	20	42
Hand. (cent.)	1085	825	2403	1919	34,719	832	79
Hand. (down.)	73	15	16	9	982	3230	3
Sync.	7	0	0	143	3	0	1278

Table 13 presents the performance on the attributes. Attributes are correctly classified in general. Attributes none and error are not present in the test dataset. However, they are not misclassified. Attribute Torso Rotation is also not mispredicted. Nevertheless, the precision and recall of this attribute are zero. This suggests that it is not classified when it shall be. Further, an improvement in this particular attribute is needed.

Table 13

The accuracy, precision, and recall for the attributes on the test dataset.

Metric	Attributes
Metric	a1	a2	a3	a4	a5	a6	a7	a8	a9	a10	a11	a12	a13	a14	a15	a16	a17	a18	a19
Accuracy	89.3	76.9	84.5	93.9	81.7	96.4	82.5	96.9	92.0	79.1	90.3	76.2	71.7	85.2	91.3	98.3	90.2	100	100
Precision	79.0	82.8	83.4	80.4	85.6	76.7	86.3	0.0	92.8	81.6	91.9	48.7	60.4	74.3	88.8	98.8	95.4	0.0	0.0
Recall	82.1	70.3	92.0	73.1	83.9	68.7	72.0	0.0	98.5	92.4	36.2	37.2	63.0	26.8	74.2	49.5	41.0	0.0	0.0

Both trained tCNNs (using a softmax and a sigmoid layer) and the attribute representation A are included in the annotation and revision tool. Implementation code of the annotation tool is also available in [39]. These results seek to give a first evaluation of the dataset for solving HAR.

5. Discussion and Conclusions

This contribution presents the first freely accessible dataset for the sensor-based recognition of human activities in logistics using semantic attributes, called LARa. Guidelines for creating a dataset were developed based on an analysis of related datasets. Following these guidelines, 758 min of picking and packing activities of 14 subjects were recorded, annotated, and revised. The dataset contains OMoCap data, IMU data, RGB videos, the recording protocol, and the tool for annotation and revision. Multichannel time-series HAR was solved for LARa using temporal convolutional neural networks (tCNNs). Classification performance is consequent to the state-of-the-art using tCNNs. Semantic descriptions or attributes of human activities improve classification performance. This supports the effort of annotating attributes and the conclusions from [16]. From an application perspective, the following approaches for fundamental research as well as industrial application result from the LARa dataset: The laboratory dataset LARa will be deployed on IMU data recorded in an industrial environment. The addition of more subjects and the inclusion of further logistical processes and objects is conceivable. New attributes may be added. Another approach to recognize human activity is the context. The context may provide information about locations and articles and broaden the application spectrum of the dataset. Context information about the process is provided in this contribution. Dependencies between the activities have to be examined, e.g., state-machines. Can information about dependencies increase the accuracy of the recognition of human activities in logistics? Finally, the industrial applicability must be proven through a comparison between sensor-based HAR and manual-time management methods, such as REFA and MTM. Can manual-time management methods be enhanced using HAR and LARa? Further experiments concerning the relation among the activity classes and the attributes will be relevant to evaluate. Analyzing the architecture’s filters and their activations using the attribute representations will be useful for understanding how the deep architectures process input signals. LARa dataset can be used for solving retrieval problem in HAR. Retrieval tasks might help facilitate data annotation. Additionally, data-stream-approaches will be relevant to be addressed using this dataset. A comparison of HAR methods using statistical pattern recognition and deep architectures is to be also addressed. The extensive LARa dataset will be of use for investigating RNNs. Computational causal behavior models are of interest for including the flow charts of the scenarios and longer temporal relations of the input signals.

Table A1

Content criteria for filtering process.

Stage	Content Criteria	Description
I	Human	Data must relate to human movements.
II	Sensor	Dataset must contain IMU or OMoCap data, or both.
III	Access	Dataset must be accessible online, downloadable and free of charge.
IV	Physical Activity	Caspersen et al. [44] defined physical activity “as any bodily movement produced by skeletal muscles that results in energy expenditure”. The definition of physical activity is limited by torso and limb movement [10].

Table A2

Examined datasets per stage.

Stage	Content Criteria	No of Datasets
I	Human	173
II	Sensor	95
III	Access	70
IV	Physical Activity	61

Table A3

Categorization scheme.

Root Category
	Subcategory	Description
General Information
	Year	Year of publication. Updates are not taken into account.
	Dataset name	Name of the dataset and the acronym
	Ref. [Dataset]	Link and, if available, the Identifier (DOI) of the dataset
	Ref. [Paper]	Identifier or, if not available, link of the paper that describes the dataset, uses it or is generally given as a reference
Domain of the Act. class
	Work	Office work, general work, and physical work in production and logistic
	Exercises	Sport activity classes, e.g., basketball, yoga, boxing, golf, ice hockey, soccer
	Locomotion	e.g., walking, running, elevating, sitting down, going upstairs, and downstairs
	ADL	Activity classes of daily living, e.g., watching TV, shopping, cooking, eating, cleaning, dressing, driving car, personal grooming, interacting, talking, lying
	Fall Detection	Falling in different directions and from different heights
	Hand Gestures	Focus on the movement of hands, e.g., arm swiping, hand waving, and clapping
	Dance	e.g., jazz dance, hip-hop dance, Salsa, Tango
Data Specification
	Recording Time [min]	Total time of the recordings in minutes
	Data Size [MiB]	Data Size of the entire unzipped dataset in mebibytes, including e.g., RGB videos, pictures
	Format	Formats of data published in the repository
	No Subjects	Number of unique subjects
	No Act. classes	Number of individual activity classes
	List Act. classes	List of all individual activity classes
	Laboratory	The recordings were made in a laboratory environment
	Real Life	The recordings were made in a real environment, e.g., outdoors, on a sports field, or in a production facility
Sensor
	OMoCap [Hz]	Optical marker-based Motion Capture with frames per second or hertz as a unit
	IMU [Hz]	Inertial measurement unit with hertz as a unit
	Other Sensors	Sensors except IMU and OMoCap
	Phone, Watch, Glasses	Use of sensors built in smartphone, smartwatch, or smart glasses

Table A4

Overview of related public available human activity recognition datasets. The entries are sorted chronologically in ascending order according to the year of publication and alphabetically according to the name of the dataset. Missing informations are marked with “-”.

General Information				Domain of the Act. class							Data Specification							Sensor				Attachment (Sensor/Marker)
Year	Dataset Name	Ref. [Dataset]	Ref. [Paper]	Work	Exercises	Locomotion	ADL	Fall Detection	Hand Gestures	Dance	Recording Time [min]	Data Size [MiB]	Format	No Subjects	No Act. Classes	Laboratory	Real Life	OMoCap [fps/Hz]	IMU [Hz]	Other Sensors	Phone, Watch, Glasses	Hand/Wrist	Lower Arm	Upper Arm	Foot/Ankle	Lower Leg	Upper Leg	Hip	Shoulder	Belly/Waist	Thorax/Chest	Lower Back	Upper Back	Head
2003	Carnegie Mellon University Motion Capture Database (CMU Mocap)	[84]	-		x	x	x			x	-	18,673	amc	112	23	x		120				x	x	x	x	x	x	x	x	x	x	x	x	x
2004	Leuven Action Database	[45]	[85]		x	x	x				-	14	text, avi, xls, pdf	1	22	x		30		RGB		x	x	x	x		x	x	x		x		x	x
2007	HDM05	[54]	[86]		x	x	x			x	-	3000.32	c3d, amc, avi	5	70	x		120		RGB		x	x	x	x	x	x	x	x		x	x	x	x
2008	Wearable Action Recognition Database (WARD)	[87]	[88]			x	x				-	41.66	mat	20	13			x		20			x	x		x	x	x
2009	BodyAttack Fitness	[83]	[89]		x						15	5.64	mat	1	6	x			64							x	x
2009	Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database	[59]	[90]				x				-	60,897.03	amc, txt, asf, wav, xls, avi	43	29	x		120	125	RGB, microphone, RFID, BodyMedia	x	x	x	x	x	x	x	x	x		x	x	x	x
2009	HCI gestures	[83]	[89]						x		-	12.9	mat	1	5	x			96				x	x
2009	HumanEva I	[91]	[92]		x	x					-	13,824	-	4	6	x			120	RGB, depth		x	x	x	x	x	x	x	x		x		x	x
2009	HumanEva II	[93]	[92]			x					-	4,649	-	2	4	x			120	RGB		x	x	x	x	x	x	x	x		x		x	x
2010	KIT Whole-Body Human Motion Database	[48]	[65]		x	x	x		x	x	-	2,097,152	xml, c3d, avi	224	43	x		100		RGB		x	x	x	x	x	x	x	x		x	x	x	x
2010	Localization Data for Person Activity Data Set	[55]	[94]			x	x	x			-	20.5	txt	5	11	x			10						x			x			x
2011	3DLife/Huawei ACM MM Grand Challenge 2011	[95]	[96]							x	-	-	svl, cvs	15	5	x			160	RGB, microphone, depth			x			x				x
2011	UCF-iPhone Data Set	[97]	[98]		x	x					-	13.1	csv	9	9		x		60		x							x
2011	Vicon Physical Action Data Set	[46]	[99]			x	x				33.33	144	txt	10	20	x		200				x	x		x	x								x
2012	Activity Prediction (WISDM)	[100]	[101]			x					-	49.1	txt	29	6	x			20		x						x
2012	Human Activity Recognition Using Smartphones Data Set (UCI HAR)	[102]	[103]			x	x				192	269	txt	30	6	x			50		x							x
2012	OPPORTUNITY Activity Recognition Data Set	[4]	[104]			x	x		x		1500	859	txt	12	24		x		32			x	x	x	x	x	x	x				x	x
2012	PAMAP2 Physical Activity Monitoring Data Set	[5]	[105]		x	x	x				600	1652.47	txt	9	18	x	x		100	heart rate monitor		x			x						x
2012	USC-SIPI Human Activity Dataset	[76]	[106]			x					-	42.7	mat	14	12		x		100									x
2013	Actitracker (WISDM)	[107]	[108]			x	x				-	2588.92	txt	29	6		x		20		x						x
2013	Daily and Sports Activities Data Set	[109]	[110]		x	x	x				760	402	csv	8	19		x		25				x				x				x
2013	Daphnet Freezing of Gait Data Set	[69]	[111]			x					500	86.2	txt	10	3	x			64	RGB						x	x					x
2013	Hand Gesture	[74]	[1]				x		x		70	47.6	mat	2	11		x		32			x	x	x
2013	Physical Activity Recognition Dataset Using Smartphone Sensors	[47]	[112]			x					-	63.1	xlsx	4	6		x		50		x			x			x	x		x
2013	Teruel-Fall (tFall)	[113]	[114]					x			-	65.5	dat	10	8		x		50		x						x
2013	Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements (PUC-Rio)	[56]	[115]			x					480	13.6	dat	4	5	x			10					x		x	x			x
2014	Activity Recognition from Single Chest-Mounted Accelerometer Data Set	[116]	[117]			x	x				431	44.2	csv	15	7		x		52												x
2014	Realistic sensor displacement benchmark dataset (REALDISP)	[118]	[119]		x	x					566.02	6717.43	txt	17	33	x			50				x	x		x	x						x
2014	Sensors activity dataset	[47]	[120]			x					2800	308	csv	10	8		x		50		x		x	x			x	x
2014	User Identification From Walking Activity Data Set	[121]	[117]			x	x				431	4.18	csv	22	5		x		52	RGB, microphone	x										x
2015	Complex Human Activities Dataset	[47]	[122]			x	x				390	240	csv	10	13		x		50		x	x					x
2015	Heterogeneity Activity Recognition Data Set (HHAR)	[60]	[123]			x					270	3333.73	csv	9	6		x		200		x		x					x
2015	Human Activity Recognition with Inertial Sensors	[57]	[124]			x	x	x			496	324	mat	19	13	x			10			x					x	x
2015	HuMoD Database	[125]	[126]		x	x					49.4	6044.27	mat	2	8	x		500		EMG					x	x	x	x	x		x	x	x	x
2015	Project Gravity	[61]	[127]			x	x	x			-	27.6	json	3	19	x			25	RGB	x	x					x
2015	Skoda Mini Checkpoint	[83]	[128]	x					x		180	80.3	mat	1	10		x		98			x	x	x
2015	Smartphone-Based Recognition of Human Activities and Postural Transitions Data Set (SBHAR)	[129]	[130]			x	x				300	240	txt	30	12		x		50	RGB	x									x
2015	UTD Multimodal Human Action Dataset (UTD-MHAD)	[49]	[131]		x	x			x		-	1316.15	mat, avi	8	27	x			50	depth			x				x
2016	Activity Recognition system based on Multisensor data fusion (AReM) Data Set	[132]	[133]			x	x				176	1.69	csv	1	6	x			70		x					x		x			x
2016	Daily Log	[51]	[134]		x	x	x				106,560	4815.97	csv	7	33		x		x	GPS	x		x				x
2016	ExtraSensory Dataset	[52]	[135]	x	x	x	x				308,320	144,423.88	dat, csv, mfcc	60	51		x		40	microphone	x	x						x
2016	HDM12 Dance	[136]	[136]							x	97	2,175.48	asf, c3d	22	20	x		128				x	x	x	x	x	x	x	x		x		x	x
2016	RealWorld	[75]	[67]			x	x				1065	3891.92	csv	15	8		x		50	GPS, magnetic field, microphone, RGB, light	x		x	x		x	x			x	x			x
2016	Smartphone Dataset for Human Activity Recognition in Ambient Assisted Living	[137]	[103]			x	x				94.79	46.5	txt	30	6		x		50		x									x
2016	UMAFall: Fall Detection Dataset	[138]	[139]			x	x	x			-	359	csv	19	14		x		200		x	x				x		x		x	x
2017	An Open Dataset for Human Activity Analysis using Smart Devices	[62]	[140]		x	x	x				-	433	csv	1	16		x		x		x	x						x						x
2017	IMU Dataset for Motion and Device Mode Classification	[141]	[50]			x					-	2835.21	mat	8	3		x		100		x	x	x	x	x	x	x	x	x				x	x
2017	Martial Arts, Dancing and Sports (MADS) Dataset	[71]	[142]		x					x	-	24,234.96	mov, zip	5	5	x		60		RGB		x	x	x	x	x	x	x	x		x	x	x	x
2017	Physical Rehabilitation Movements Data Set (UI-PRMD)	[66]	[143]		x	x					-	4700.17	txt	10	10	x		100		depth		x	x	x	x	x	x	x	x		x		x	x
2017	SisFall	[144]	[145]			x	x	x			1849.33	1627.67	txt	38	34		x		200											x
2017	TotalCapture Dataset	[72]	[146]		x	x					-	-	-	5	5	x		x	60	RGB		x	x	x	x	x	x	x	x		x	x	x	x
2017	UniMiB SHAR	[147]	[148]			x	x	x			-	255	mat	30	17	x			50	microphone	x						x
2018	Fall-UP Dataset (Human Activity Recognition)	[149]	[150]			x	x	x			165.00	78	csv	17	11	x			100	infrared, RGB		x				x		x	x	x			x	x
2018	First-Person View	[63]	-			x	x				-	1046.86	mp4, csv	2	7		x		x	RGB	x		x								x			x
2018	HAD-AW	[64]	[151]	x	x	x	x			x	-	325	xlsx	16	31		x		50		x	x
2018	HuGaDB	[152]	[153]			x					600	401	txt	18	12		x		x	EMG					x	x	x
2018	Oxford Inertial Odometry Dataset (OxIOD)	[53]	[154]			x					883.2	2751.73	csv	4	2	x	x	250	100		x	x						x
2018	Simulated Falls and Daily Living Activities Data Set	[155]	[156]			x	x	x			630	3972.06	txt	17	36	x			25				x			x	x				x	x		x
2018	UMONS-TAICHI	[157]	[158]		x						-	28,242.47	txt, c3d, tsv	12	13	x		179		RGB, depth		x	x	x	x	x	x	x	x		x		x	x
2019	AndyData-lab-onePerson	[70]	[32]	x		x	x				300	99,803.46	mvn, mvnx, c3d, bvh, csv, qtm, mp4	13	6	x		120	240	RGB, pressure sensor handglove		x	x	x	x	x	x	x	x		x	x	x	x
2019	PPG-DaLiA	[58]	[159]	x		x	x				2,190	23,016.74	pkl, csv	15	8		x		700	PPG, ECG		x									x
Sum	61			5	20	51	35	9	6	7						33	30				25	29	30	24	20	28	36	31	16	10	25	11	18	21
Min.											15	1.69		1	2			30	10
Avg.											13,531.08	43,605.15		21.1	14.8			155.9	86.2
Max.											308,320	2,097,152		224	70			500	700
2019	LogisticActivityRecognitionChallenge (LARa)	[160]		x		x					758	58,907.15	csv, mp4, pdf, pt, py	14	8	x		200	100	RGB		x	x	x	x	x	x	x	x	x	x	x	x	x

16 in total

LARa: Creating a Dataset for Human Activity Recognition in Logistics Using Semantic Attributes.

1. Introduction

2. Related Work

3. Introducing the LARa Dataset

3.1. Guidelines for Creating and Publishing a Dataset

3.2. Laboratory Set-Ups based on Logistics Scenarios

3.2.1. Logistics Scenario 1—Simplified Order Picking System

3.2.2. Logistics Scenario 2—Real-World Order Picking and Consolidation System

3.2.3. Logistics Scenario 3—Real-World Packaging Process

3.3. Configuration of Sensors and Markers

3.4. Characteristics of Participating Subjects

3.5. Recording Procedure

3.5.1. Preliminaries

3.5.2. Recording Process

3.5.3. Documentation and Protocol

3.6. Classes and Attributes

3.6.1. Activity Classes

3.6.2. Attributes

3.6.3. Exemplary Activity Sequence and Its Proper Annotation

3.7. Annotation and Revision

3.8. Folder Overview of the LARa Dataset

4. Deploying LARa for HAR

5. Discussion and Conclusions

1. Perception of biological motion: a stimulus set of human point-light actions.

2. Physical activity, exercise, and physical fitness: definitions and distinctions for health-related research.

3. SisFall: A Fall and Movement Dataset.

4. UP-Fall Detection Dataset: A Multimodal Approach.

5. Deep PPG: Large-Scale Heart Rate Estimation with Convolutional Neural Networks.

6. Fusion of smartphone motion sensors for physical activity recognition.

7. Detecting falls with wearable sensors using machine learning techniques.

8. Complex Human Activity Recognition Using Smartphone and Wrist-Worn Motion Sensors.

9. Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition.

10. UMONS-TAICHI: A multimodal motion capture dataset of expertise in Taijiquan gestures.

1. Context-Aware Human Activity Recognition in Industrial Processes.

2. Recognizing Solo Jazz Dance Moves Using a Single Leg-Attached Inertial Wearable Device.