Literature DB >> 34209582

An Enhanced Joint Hilbert Embedding-Based Metric to Support Mocap Data Classification with Preserved Interpretability.

Cristian Kaori Valencia-Marin¹, Juan Diego Pulgarin-Giraldo², Luisa Fernanda Velasquez-Martinez³, Andres Marino Alvarez-Meza³, German Castellanos-Dominguez³.

Abstract

Motion capture (Mocap) data are widely used as time series to study human movement. Indeed, animation movies, video games, and biomechanical systems for rehabilitation are significant applications related to Mocap data. However, classifying multi-channel time series from Mocap requires coding the intrinsic dependencies (even nonlinear relationships) between human body joints. Furthermore, the same human action may have variations because the individual alters their movement and therefore the inter/intraclass variability. Here, we introduce an enhanced Hilbert embedding-based approach from a cross-covariance operator, termed EHECCO, to map the input Mocap time series to a tensor space built from both 3D skeletal joints and a principal component analysis-based projection. Obtained results demonstrate how EHECCO represents and discriminates joint probability distributions as kernel-based evaluation of input time series within a tensor reproducing kernel Hilbert space (RKHS). Our approach achieves competitive classification results for style/subject and action recognition tasks on well-known publicly available databases. Moreover, EHECCO favors the interpretation of relevant anthropometric variables correlated with players' expertise and acted movement on a Tennis-Mocap database (also publicly available with this work). Thereby, our EHECCO-based framework provides a unified representation (through the tensor RKHS) of the Mocap time series to compute linear correlations between a coded metric from joint distributions and player properties, i.e., age, body measurements, and sport movement (action class).

Entities: Chemical Disease Gene Species

Keywords: Hilbert embedding; Mocap data; classification; joint distribution; time series

Mesh：

Year: 2021 PMID： 34209582 PMCID： PMC8271882 DOI： 10.3390/s21134443

Source DB: PubMed Journal: Sensors (Basel) ISSN： 1424-8220 Impact factor: 3.576

1. Introduction

Time series classification is a real-world problem that frequently deals with vast quantities of numerical measurements acquired at regular time intervals, having applications in fields such as share markets, biomedicine, intelligent sensor networks, and dynamic objects, among others [1,2,3,4]. Thus, in the case of moving objects, a contour of a static object can be transformed into a time series representation to favor image-based object recognition tasks [5,6,7]. Moreover, when classifying time series, one of the essential tasks is recognizing human actions. Most applications focused on the recognition of human activities are based on the construction of 3D skeletons composed of the human body joints extracted from computer vision systems using traditional video cameras (Microsoft Kinect and similar devices) [8]. However, these systems suffer from optical phenomena that affect their precision, such as changes in lighting and occlusions [9]. Then, to improve human pose tracking, there is considerable interest in techniques that avoid using a video camera—for example, WiFi human sensing [10] and radio-frequency identification (RFID) tags [11]. On the other hand, there are alternative methodologies based on holographic interferometry [12,13] that are remarkably robust to deformations and allow the skeletons of subjects to be adequately represented. Regarding the motion capture (Mocap)-based human action analysis, different applications involve the classification of Mocap datasets, such as animation movies and video games [14], biomechanics systems for rehabilitation [15], and translation of sign languages [16], among others. However, Mocap data pose some issues for classifying human activities from time series. First, there is a need to code the time series dependencies (relationships between Mocap joints) to highlight discriminative patterns [17]. Second, a performance of a particular activity may have variations, which can be the results of individuals’ alteration of expression, posture, motion, and perspective effects [18]. In addition, the same sequence can be executed in different ways (styles) by distinct subjects [19]. Third, the Mocap data trajectories, obtained from 3D skeletal representations, are coded on high-dimensional spaces holding non-stationary dynamics [20]. In the literature, two main approaches are used to deal with time series representation and classification tasks: model-based (MB) and distance-based (DB) methods [2]. MB allows coding the temporal dependencies between time series from a set of parameters associated with a given stochastic or deterministic model. Some of the relevant examples include the hidden Markov models (HMMs) [21], the adaptive filters (AFs) [22,23], the Gaussian processes (GPs) [24,25], and deep networks [26,27]. HMM represents the input data from a sequence of hidden states that encode temporal dependencies among samples; nevertheless, an appropriate choice of the model’s topology/architecture is required, e.g., the covariance matrix shape and the number of hidden states [28]. In the case of AFs, they allow recursive learning of the time series, giving prominence to the most relevant data samples [29]. However, the quantization size and the error tolerance must be tuned appropriately, which can be problematic for 3D skeletal-based samples [30]. Regarding the GP-based methods, a Bayesian representation of time series is carried out. Although GPs are considered nonparametric models, their training is often computationally expensive when calculating the posterior distribution [25]. Recently, deep learning methods have been used for Mocap data classification [26,27,31]. Even though the classification performance is reasonable, exhaustive training is required, the overfitting issue arises for small databases, and the provided algorithms often lack straightforward interpretability [32]. Now, DB approaches reside in the construction of a dissimilarity space from the input time series, which are later used to train a classifier, e.g., a K-nearest neighbors [33,34]. In general, the Euclidean distance (ED) is the most straightforward DB approach. Nonetheless, ED can only be applied to discriminate time series of the same length [35]. Therefore, the dynamic time warping (DTW) dissimilarity appears as an extension of the ED, also known as 2-norm-based distance, to compare series of different lengths [36]. The DTW is quite well known for discriminating time series as it can be seen as a generalization of the ED exclusively for this kind of data [37]. Nevertheless, DTW requires crucial hyperparameter (warping percentage) tuning, and l2-based approaches tend to fail when coding nonlinear patterns [36]. In turn, reproducing kernel Hilbert space (RKHS)-based approaches have been proposed to highlight nonlinear data relationships [38]. Furthermore, Hilbert embedding-based dissimilarities have been introduced in the literature as a generalization of traditional kernel methods, mapping the input data probability distribution as a vector/operator in RKHS. The latter favors the estimation of dissimilarity-based measures within high dimensional spaces [39]. Of note, the Lie group representation approach is commonly applied on skeletal action recognition tasks [40,41,42]. However, Lie group-based methods suffer from temporal misalignment, which tends to deteriorate the classification accuracy [31]. To solve this problem, the DTW is coupled with the Lie group; nonetheless, the computational time is increased, and a two-step algorithm typically performs worse than an end-to-end learning strategy [31]. In this paper, an enhanced Hilbert embedding-based framework is proposed as a DB approach to support Mocap data classification. In this sense, a novel metric is introduced to map joint probability distributions, from two different input spaces, in a tensor RKHS through the cross-covariance operator [43,44]. Our approach, termed enhanced Hilbert embedding from cross-covariance operator (EHECCO), allows comparing input data from sample-based kernel evaluations, circumventing the direct estimation of probability functions. The latter helps in the analysis of multi-view instances in pattern recognition tasks, i.e., classification from data fusion [45]. Then, we aim to code temporal information from sequentiality data to support further classification stages regarding human action recognition (HAR). The most significant contributions in this work can be summarized as follows: (i) a novel analytical expression for calculating an RKHS-based dissimilarity to discriminate between joint probability distributions; (ii) a representation strategy for the extraction and processing of skeletons from Mocap videos, which allows finding the most relevant and discriminating movement patterns; and (iii) a recognition framework of human activities and style based on EHECCO, which allows anthropometric analysis and proper interpretation of the results obtained. Indeed, our EHECCO-based framework for HAR facilitates the computation of linear correlations between the coded metric, player properties (age, body measurements, among others), and human action classes. Of note, EHECCO can deal with different time series lengths, preserving the most relevant frames (human poses) when comparing the Mocap time series. Our method is a crucial improvement compared with conventional human movement analysis approaches, which employ alienation angles, linear velocities, and angular velocities as factors to be evaluated [46]. The approach is tested on both public (for action and style recognition) and our own (for action recognition and anthropometric analysis) Mocap datasets. Results obtained are competitive in terms of the achieved classification accuracy with the benefit of Mocap data interpretability. The remainder of this paper is organized as follows: Section 2 describes the mathematical background. Section 3 shows the experimental set-up. Section 4 presents the results and discussion. Finally, the conclusions appear in Section 5.

2. Methods

In this section, we provide the mathematical background concerning our Hilbert embedding-based metric. First, the well-known marginal embedding approach is briefly described. Then, we present our joint embedding proposal to build a metric in a tensor RKHS from joint distributions. Our approach seeks to exploit two main issues: (i) joint distribution-based modeling from two different input spaces, and (ii) non-linear sample mapping to code relevant data dependencies from joint distributions circumventing the direct estimation of probability functions. The latter would be helpful to deal with multi-channel time series, which is the basis of our experimental set-up concerning HAR from Mocap data.

2.1. Marginal Embedding-Based Metric in RKHS

Let be the space of all marginal probability distributions on . Moreover, let X be a random variable with distribution . A marginal embedding can be defined as [47]: where is a given sample and is a reproducing kernel Hilbert space (RKHS) holding the nonlinear mapping . stands for the expectation operator. Furthermore, let Z be another random variable with distribution and marginal embedding Then, a distance metric between probability distributions can be defined in from the marginal embeddings and as: where stands for the norm operator in Founded on the kernel trick property , being a positive semi-definite characteristic kernel function [39], the metric in Equation (2) can be rewritten as [48]: with and The expression in Equation (3) is an analytical metric function in RKHS for probability distributions [49]. In fact, the well-known maximum mean discrepancy (MMD) distance arises from Equation (3) to extend traditional kernel methods for estimating probability functions [22,50]. Namely, let and be a pair of sets holding N and M samples, respectively. Moreover, let us assume that the probability distributions and admit density functions and . Then, after fixing the empiric-based estimators and , stands for the delta function, and using a Gaussian characteristic kernel , is a similarity bandwidth, the MMD estimator is given by [51]: where and are kernel matrices computed from and are all one column vectors of size N and respectively.

2.2. Enhanced Hilbert Embedding from Cross-Covariance Operator (EHECCO)

Though MMD in Equation (4) allows comparing samples without any assumption over probability distributions, it only codes the marginal information when performing the distance-based representation. Therefore, dealing with complex data relationships—for example, Mocap time series classification for HAR—will benefit from representing the instances on different RHKS to code contrasting properties of the samples. Then, a joint distribution-based metric can be developed. Let us consider another pair of random variables with distributions , where is the space of all marginal distributions on ; further, let and be samples from the aforementioned random variables. Our enhanced Hilbert embedding from cross-covariance operator (EHECCO) allows computing a metric between the joint distributions where is the space of all joint probability distributions defined on the Cartesian product . Following the metric in Equation (2), the RKHS-based distance between joint probability distributions yields: where the Hilbert embeddings , being a tensor space, can be defined as the following cross-covariance operators [48]: where are nonlinear mappings to the RKHS and , following the positive semi-definite characteristic kernels: , and [49]. The latter is accomplished too for samples of the random variables Z and L, respectively. Furthermore, let us assume that and admit density functions and , respectively; then, and . We can rewrite Equation (5) as follows [52]: Of note, the metric presented in Equations (5) and (8) (see Figure 1 for a schematic illustration) favors the extraction of relevant patterns from joint distributions as vector-based mappings in RKHS. Indeed, Hilbert embedding-based feature representations allow mapping marginal, conditional, and joint distributions into feature spaces using kernels, comparing and manipulating these distributions via feature space operations [44]. Our proposal is a direct extension of the conventional marginal embedding approach presented in Equation (2) towards a metric between joint distribution (see Theorem 1 in [48]). Moreover, it is well known in the machine learning literature that kernel-based methods favor highlighting nonlinear dependencies from input samples by mapping them to high-dimensional, possibly infinite, Hilbert space, revealing discriminative data patterns [53].

Figure 1

Schematic illustration of our EHECCO-based metric. Input spaces and are mapped to RKHSs and respectively. Then, the tensor space is built using a cross-covariance operator strategy.

For concrete testing, let and be a pair of input sets (time series coded into two different spaces), and our matrix-based estimator in Equation (8) yields: where the kernel matrices and are computed based on the kernel functions and . The operator ∘ stands for the Hadamard product. Moreover, the probability column vectors and hold the joint probability estimators and respectively. It is worth mentioning that our EHECCO estimator in Equation (9) provides a data-driven metric in the tensor space to compare the joint distributions and as kernel-based operations of input vectors. Remarkably, it can benefit further classification stages by extracting discriminative features from high-dimensional feature spaces through our kernel-based approach. In short, our EHECCO-based metric seeks to exploit two main issues: (i) joint distribution-based time series modeling from two different input spaces, and (ii) non-linear data mapping to code relevant sample dependencies from joint distributions, circumventing the direct estimation of probability functions. Regarding the classification of multi-channel time series, i.e., HAR based on Mocap records, spatio-temporal relationships can be highlighted from the joint space (tensor RKHS), favoring data discrimination. Moreover, as our EHECCO-based metric can deal with different time series lengths, the most relevant frames (human poses) can be preserved when comparing time series. The latter is a crucial improvement compared with conventional human movement analysis approaches, which employ alienation angles, linear velocities, and angular velocities as factors to be evaluated [46].

3. Experimental Setup

Our EHECCO metric in Equation (9) is used to construct a HAR framework from Mocap videos. Thereby, we aim to demonstrate the discriminative capability and interpretability benefits of our joint distribution-based embedding approach to deal with multi-channel time series related to human movement. Then, the experimental design of our EHECCO-based framework can be summarized in the following stages: 3D joint normalization. A 3D joint representation is extracted from each Mocap record followed by a hip-based normalization [27]. Codebook generation. A codebook of Mocap frames is built to gather the most representative movement poses. Then, a set of clusters is computed using the well-known spectral clustering algorithm [54], from a vector-based concatenation of the 3D joints. The radial basis function is used as similarity, fixing the bandwidth as the median of the input Euclidean distances. Joint and latent space-based representations. To code relevant patterns from provided codebooks, both the input joints and their latent space are considered to build a Mocap video input set: . Here, the well-known principal component analysis (PCA) algorithm is employed to compute a latent space coding the most relevant orthonormal basis concerning the preserved input channels’ variability [55]. In fact, for concrete testing, three principal components are considered (). According to our experiments, three components preserve at least of the input data variability. Note that the V value equals the number of Mocap joints times three (3D skeleton). EHECCO-based dissimilarity representation and classification. Given a a pair of Mocap video sets: , , our EHECCO-based distance measure in Equation (9) is computed. In turn, a dissimilarity matrix is calculated as EHECCO-based pairwise Mocap video comparisons ( stands for the number of processed Mocap videos). For the tested databases, the probability vectors are fixed as , being the uniform distribution. Since the Gaussian kernel is preferred in pattern classification because of its universal approximating ability and mathematical tractability [56], and are fixed as Gaussians. Each kernel bandwidth is searched within the range concerning the final classification performance. equals the median of input Euclidean distances in accordance with each studied space (input Mocap joints) or (PCA-based latent projection). Finally, a support vector machine (SVM) classifier is trained on the EHECCO’s distance matrix. A radial basis function (nonlinear mapping) is set for the SVM, and the penalty and precision hyper-parameters are settled from the grids and , respectively, concerning the classification performance. In addition, 2D data projection is also provided from the EHECOO metric for visualization purposes. Figure 2 also summarizes the provided EHECCO-based flowchart for Mocap data classification.

Figure 2

EHECCO-based Mocap data classification framework. Hip joint normalization and spectral clustering-based codebook generation are carried out to extract relevant skeletal poses. Then, 3D joint representation () and PCA-based latent projection ( ) are used to support the EHECCO metric from joint probability. Lastly, an SVM classifier is trained from the EHECCO distance that also supports 2D data visualization.

3.1. Mocap Databases

For concrete testing, the following databases are tested for human action classification and analysis from the Mocap data: HDM05 for style/subject recognition (http://resources.mpi-inf.mpg.de/HDM05/, accessed on 5 October 2020). This database includes 325 records (from 65 actions) performed by five different subjects. The dataset includes several recorded actions using a Vicon mocap system, where 31 reflective markers are placed on the subject’s bodies [57]. Then, multi-channel time series of BVH files at 120 frames per second is provided. Following the framework proposed by the authors in [27], we built a scheme for style classification (subject recognition). We relate the classes to each of the five subjects who perform the actions as follows: subject 1 (s1) and similarly for the other subjects. CMU subset for action recognition (http://mocap.cs.cmu.edu/info.php, accessed on 5 October 2020). Mocap data are obtained from the Carnegie Mellon Graphics Laboratory, holding 12 Vicon infrared MX-40 cameras at 120 Hz with images of four-megapixel resolution. The cameras are placed around a rectangular area, of approximately 3 m × 8 m, in the center of the room. In particular, multi-channel time series as BVH files with 38 markers are provided. In the same way, as in [26], an action recognition task is carried out from a subset of 150 clips of 15 different motion classes (performed by several subjects): walking (wal), running (run), sitting (sit), jumping (jum), weight-carrying (wei), climbing (cli), swinging (swn), placing a ball (plb), placing tee (plt), kicking (kic), soccer and basketball playing (soc), boxing (box), swimming (swm), salsa (sal), and Indian Bollywood dancing (InB). Tennis-Mocap for action recognition and anthropometric analysis (https://drive.google.com/file/d/1-3HAUP4vIBBMz21f7RRgA4b89uNrLxvr/view?usp=sharing, accessed on 5 October 2020). The data are collected from 17 players of the Caldas-Colombia tennis league. The employed motion capture protocol includes the placement of 34 markers for collecting information on body joints. Optitrack Flex V100 (100 Hz) infrared videography is collected from six cameras to acquire sagittal, frontal, and lateral planes. All subjects are encouraged to hit the ball with the same velocity and action as in a tennis match. Moreover, the players are instructed to hit one series continuously by 30 s of each indicated stroke: serve (Ser), forehand (For), backhand (Bac), volley (Vol), backhand volley (BaV), and smash (Sma). In addition, the Tennis database includes the anthropomorphic players’ measurements depicted in Table 1.

Table 1

Tennis dataset’s anthropomorphic measurements. The color represents the measurement group: age (brown), weight (light green), length (red), perimeters (blue), fat fold (pink), and tennis move (black).

Age	Thigh cm (THI)	Height cm (HEI)	Medial calf mm (CAL)
Mass	Calf maximum cm (CALM)	Foot length cm (LFE)	Biceps mm (BIC)
Cephalic cm (CEP)	Relaxed arm cm (ARMR)	Biliocrestal cm (BIL)	Front thigh mm (THIF)
Minimum ankle cm (ANK)	Mesosternal chest cm (MEC)	Humerus cm (HUM)	Forehand (FORE)
Hip max cm (HIP)	Forearm cm (FOR)	Supraspinal mm (SUP)	Smash (SMA)
Contracted arm 90 cm (ARMC)	Bistyloid cm (BIS)	Subscapular mm (SUB)	Backhand (BAC)
Waist cm (WAI)	Biacromial cm (BIA)	Iliac crest mm (ILI)	Serve (SER)
Middle thigh cm (THIM)	Femur knee cm (FEK)	Triceps mm (TRI)	Volley (VOL)
Wrist cm (WRI)	Wingspan cm (WIN)	Abdominal mm (ABD)	Backhand Volley (BAV)

3.2. Method Comparison, Quality Assessment, and Implementation Details

To evaluate the performance of our EHECCO-based framework to classify Mocap data, we compare the results on the public databases (HDM05 and CMU subset) obtained in HAR with relevant state-of-the-art approaches: Method comparison for HDM05 dataset (style/subject recognition). We compare our own method with the following methods: symmetric positive definite network (SPDNet) [40], special Euclidean group (SE) [41], special orthogonal group (SO) [42], Lie groups on deep neural networks (LieNet) [31], and works based on 3D sequence to RGB image transformation (Seq2im) [27]. Method comparison for CMU subset (action recognition). We compare our results with the following approaches: motion template combined with a DTW-based classifier (MT+DTW) [58], self-similarity matrix with DTW distance (SSM+DTW) [18], efficient motion retrieval (EMR) [59], and motion words with convolutional neural networks (MW+CNN) [26]. Afterward, regarding the Tennis-Mocap database (own database), we carried out action recognition tasks along with anthropometric analysis using the extracted EHECCO-based patterns together with the measurements presented in Table 1. As a quality assessment, we use a 10-fold cross-validation strategy based on the well-known average accuracy and confusion matrix performance measures [54]. As an illustrative example, the accuracy for a binary classification case is defined as , where and are the true positive and true negative classifier’s predictions, respectively, being N the number of studied samples. Similarly, the confusion matrix for a binary classification task includes an array holding the values of and in the main diagonal and the false positive ( ) and false negative ( ) predictions on the upper and lower triangular matrix positions. All our experiments are implemented in Python using the sklearn toolbox for the training and validation of the models and the PyMO library (https://github.com/omimo/PyMO, accessed on 5 October 2020) for the management and representation of Mocap data. The most relevant codes of this paper can be found in a publicly available repository (https://github.com/Ckvalencia/hello-world/blob/master/SHECCO_CMU_sub.ipynb, accessed on 12 April 2021).

4. Results and Discussion

This section describes the classification results obtained by EHECCO-based distance for the Mocap datasets specified in Section 3.1.

4.1. HDM05 and CMU Results: Mocap Classification Benchmark

Figure 3 presents an example of relevant skeletons (codebook generation) for a given Mocap video selected from HDM05 and CMU datasets. For illustration purposes, two classes are investigated: throwing high with the right hand while standing and boxing, for which the 2D PCA projection is dotted with colored points, while the recorded frames are pictured with black points. Note that the frames chosen by the clustering algorithm are distributed so that they cover the entire space. As seen, the algorithm manages to capture the most relevant information about the movement without significant loss of information. Furthermore, the boxing record results show how both the codebook generation and the PCA-based projection preserve the cyclic action behavior, e.g., the subject acting several times.

Figure 3

Illustrative results for codebook generation and latent space-based representation (HDM05 and CMU subset datasets). Top: Codebook generation for a Mocap video of the throwing high with the right hand while standing class (HDM05). Middle: Codebook generation for a Mocap record of boxing class (CMU subset). Bottom left: PCA-based latent space for HDM05 video. Bottom right: PCA-based latent space for CMU subset video. The first two components are shown for visualization purposes. Black markers represent the original input Mocap frames (time series). Color markers represent the chosen frames (codebook).

For each database, Figure 4 presents the confusion matrix along with the 2D low-dimensional scatter plot performed by the EHECCO distance matrix using the t-distributed stochastic neighbor embedding (t-SNE) algorithm [60]. The scatter plot visually interprets the EHECCO patterns, preserving the spatial relationships in the higher tensor space (nearest-neighbors) [61]. As a result, our EHECCO approach achieves a competitive discrimination performance concerning both subject/style and action recognition tasks, reaching an average accuracy of and 90 percentage in HDM05 and CMU subsets, respectively. The scatters also evidence the EHECCO’s ability to reveal both local and global data patterns. Of note, some classes hold nonstationary behavior, due to groups overlapping, i.e., see the confusion matrices and the 2D projections for subject two vs. subject five in HDM05: sal vs. cli, soc vs. sit, and plb vs. kic actions for the CMU subset. The behavior of this latter paired comparison is expected because of the Mocap data variations [19]. Overall, the combination of EHECCO with SVM can deal with the intra/interclass variability.

Figure 4

EHECCO-based classification results for HDM05 and CMU subset databases. Top left: HDM05’s confusion matrix (style/subject recognition). Top right: HDM05 t-SNE-based 2D projection from EHECCO distance. Bottom left: CMU subset’s confusion matrix (action recognition). Bottom right: CMU subset t-SNE-based 2D projection from EHECCO distance.

One more aspect to highlight is comparing the performance EHECCO classification performance with several state-of-the-art results recently reported. Thus, Table 2 shows the accuracy results for the HDM05 dataset, including the following methods: symmetric positive definite network (SPDNet) [40], special Euclidean group (SE) [41], special orthogonal group (SO) [42], Lie groups on deep neural networks (LieNet) [31], and sequence to RGB image (Seq2im) [27]. The latter employs 3D sequence to RGB image transformation combined with conventional classifiers such as SVM, K-nearest neighbors (KNN), random forest (RF), and convolutional neural networks (CNN). As seen, the EHECCO+SVM combination overcomes the state-of-the-art techniques compared, including those based on deep learning such as Seq2im+CNN. Nevertheless, deep learning approaches often require exhaustive fine-tuning, whereas our EHECCO-based metric provides a data-driven technique as input vector evaluations for nonlinear pattern extraction in RHKS.

Table 2

Comparing results of Mocap-based style/subject recognition (HDM05 dataset). The average accuracy is reported concerning the cited works vs. our approach—EHECCO+SVM.

Method	Accuracy (%)
SPDNet [40]	61.45
SE [41]	70.26
SO [42]	71.31
LieNet [31]	75.78
Seq2Im+SVM [27]	70.70
Seq2Im+KNN [27]	66.82
Seq2IM+RF [27]	80.62
Seq2Im+CNN (fine-tuning) [27]	83.33
EHECCO+SVM	88.80

Furthermore, Table 3 presents the comparison results for the CMU subset, which includes the motion template (MT), self-similarity matrix (SSM), and efficient motion retrieval (EMR) methods [18,58,59], relying on dissimilarity matrices obtained from Mocap data feature extraction techniques and the DTW distance. Although they managed to obtain promising results, their achieved performance is not competitive enough concerning more recent methods. Motion word-(MW)-based methodology [26] yields competitive accuracy. In fact, MW incorporates a deep learning scheme to favor the time series representation. Our EHECOO outperforms most of the compared works, and it is rather similar regarding the achieved accuracy compared to the work proposed in [26]. Hence, EHECOO allows encoding nonlinear Mocap data similarities from both the 3D skeleton and PCA-based latent space through a joint distribution comparison perspective. Thereby, the EHECCO+SVM pipeline supports both the style and action recognition performance with the benefit of providing the metric interpretability of the extracted representation.

Table 3

Comparing results of Mocap-based action recognition (CMU subset database). The average accuracy is reported concerning the cited works vs. our approach—EHECCO+SVM.

Method	Accuracy (%)
MT+DTW [58]	82.9
SSM+DTW [18]	85.3
EMR [59]	86.7
MW+CNN [26]	90.7
EHECCO+SVM	90.0

4.2. Tennis-Mocap Results: Classification and Anthropomorphic Analysis

Figure 5 depicts the codebook generation (relevant poses) for some videos of the Tennis-Mocap database. Usually, the alienation angles, linear velocities, and angular velocities are factors to be evaluated in the training of a professional tennis player [46,62]. Nevertheless, the analysis of the action execution is costly and involves kinetic analysis with additional instrumentation [63]. Our method shows a valuable tool based only on kinematic information provided by optical sensors. Indeed, our EHECCO-based approach allows encoding the relevant poses characterizing from the time series (tennis action) without any manual frame segmentation or preprocessing. As seen, the provided codebook encodes the most relevant information in the first execution of each record and some significant variations in the posterior executions of the action.

Figure 5

Illustrative results for codebook generation (Tennis-Mocap dataset). Top: forehand; Middle: volley; Bottom: Smash.

Regarding the classification results, as can be seen at the top of Figure 6, accuracies over 80% are attained. The lowest performance must be analyzed in conjunction with the action, where the upper limb’s position in the most relevant poses makes these classes closer. Nevertheless, each record classified contains 12 to 16 continuous stroke executions without segmentation, so the confused actions depend on the execution speed after 30 s. The latter can also be corroborated by the 2D t-SNE data projection, where both the action and the players’ expertise are presented. As seen, intra and interclass variability are revealed, corroborating the EHECCO’s ability to highlight nonlinear patterns related to the player’s performance (style/expertise) and the action behavior. However, movements such as smash, serve, and forehand involve a significant arm span in execution, being difficult to separate. Moreover, they involve major upper-body power/strength as referred in [64]. Though the “arm span” measure is used in anthropometric tennis studies, it has no statistical significance in the early stages when classifying competitive and non-competitive players [65].

Figure 6

EHECCO-based classification and anthropomorphic measurement results for Tennis-Mocap database. Top left: confusion matrix (action recognition). Top right: t-SNE-based 2D projection from EHECCO distance. Bottom left: Absolute value of the Pearson’s correlation coefficient between the EHECCO first t-SNE-based mean projection of each player’s videos and his/her anthropomorphic measurements. The most relevant correlations are shown.

Lastly, the bottom of Figure 6 displays the Pearson’s correlation-based analysis (absolute value) to compute the linear dependencies between the mean 1D t-SNE-based projection of the players’ samples (from EHECOO metric) and the Tennis-Mocap dataset anthropometric measurements (see Table 1). In particular, the correlation analysis is carried out concerning the six movements performed by the players to find the incidence of each physical variable in the execution of the studied actions. As seen, fat fold variables are highly correlated with each other, similarly to the perimeter variables. Moreover, the tennis actions share substantial correlations with the players’ perimeter measurements (blue), specifically with the forehand, backhand, and volley classes. Notably, EHECOO-based interpretability follows the fact that anthropometric characteristics related to the size of the limbs and other parts of the body have a more significant influence on players’ performance than features related to age, weight, height, and strength [66].

5. Concluding Remarks

We introduced a new enhanced Hilbert embedding-based framework from a cross-covariance operator, termed EHECCO, to represent and discriminate joint probability distributions in RKHS. Our approach favors the extraction of relevant nonlinear dependencies from input vectors to support the time series classification. In this sense, an EHECCO-based framework is tested to support Mocap data classification concerning style/subject and action recognition as well as anthropometric analysis. The introduced framework includes a codebook generation and a PCA-based latent space extraction for coding the most relevant frames and patterns from the Mocap series. Then, our EHECCO-based metric is computed to feed an SVM classifier. Provided experiments include the well-known public databases HDM05 and CMU subset and our own dataset, Tennis-Mocap (also publicly available). As shown, EHECCO obtains competitive classification performances for both style and action recognition, outperforming state-of-the-art approaches. Moreover, EHECOO codes the intra and interclass variability and favors the interpretation of relevant anthropometric variables correlated with subject expertise and performed actions. As future work, the authors plan to include other anthropometric and sports measurements to enhance the proposed framework, i.e., the arm span will be more sensitive in elite players’ classification [65]. Moreover, EHECCO-based HAR applications from conventional video camaras [8], WiFi human sensing [10], and RFID [11] data will be carried out. Further, we plan to test the EHECCO metric on other types of time series, i.e., brain activity data [67]. Additionally, more elaborate classifiers and deep learning schemes can benefit from our EHECCO metric [68]. Finally, an extension of the EHECCO distance for the joint distribution of multiple spaces, not only two, is a research line of interest.

7 in total

1. Investigating Hidden Markov Models' capabilities in 2D shape classification.

Authors: Manuele Bicego; Vittorio Murino
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2004-02 Impact factor: 6.226

2. Key factors and timing patterns in the tennis forehand of different skill levels.

Authors: Johannes Landlinger; Stefan Lindinger; Thomas Stöggl; Herbert Wagner; Erich Müller
Journal: J Sports Sci Med Date: 2010-12-01 Impact factor: 2.988

3. Does stroke performance in amateur tennis players depend on functional power generating capacity?

Authors: Gabriel Delgado-García; Jos Vanrenterghem; Alejandro Muñoz-García; Alejandro Molina-Molina; Víctor M Soto-Hermoso
Journal: J Sports Med Phys Fitness Date: 2018-08-28 Impact factor: 1.637

Review 4. Principal component analysis: a review and recent developments.

Authors: Ian T Jolliffe; Jorge Cadima
Journal: Philos Trans A Math Phys Eng Sci Date: 2016-04-13 Impact factor: 4.226

5. Impact of Physical Performance and Anthropometric Characteristics on Serve Velocity in Elite Junior Tennis Players.

Authors: Janina Fett; Alexander Ulbricht; Alexander Ferrauti
Journal: J Strength Cond Res Date: 2020-01 Impact factor: 3.775

6. Single-Trial Kernel-Based Functional Connectivity for Enhanced Feature Extraction in Motor-Related Tasks.

Authors: Daniel Guillermo García-Murillo; Andres Alvarez-Meza; German Castellanos-Dominguez
Journal: Sensors (Basel) Date: 2021-04-13 Impact factor: 3.576

Review 7. Upper Limb Physical Rehabilitation Using Serious Videogames and Motion Capture Systems: A Systematic Review.

Authors: Andrea Catherine Alarcón-Aldana; Mauro Callejas-Cuervo; Antonio Padilha Lanari Bo
Journal: Sensors (Basel) Date: 2020-10-22 Impact factor: 3.576

7 in total

2 in total

1. A Vision-Based System for Stage Classification of Parkinsonian Gait Using Machine Learning and Synthetic Data.

Authors: Jorge Marquez Chavez; Wei Tang
Journal: Sensors (Basel) Date: 2022-06-13 Impact factor: 3.847

2. Subject-Dependent Artifact Removal for Enhancing Motor Imagery Classifier Performance under Poor Skills.

Authors: Mateo Tobón-Henao; Andrés Álvarez-Meza; Germán Castellanos-Domínguez
Journal: Sensors (Basel) Date: 2022-08-02 Impact factor: 3.847

2 in total