Literature DB >> 32942561

VI-Net-View-Invariant Quality of Human Movement Assessment.

Faegheh Sardari¹, Adeline Paiement², Sion Hannuna¹, Majid Mirmehdi¹.

Abstract

We propose a view-invariant method towards the assessment of the quality of human movements which does not rely on skeleton data. Our end-to-end convolutional neural network consists of two stages, where at first a view-invariant trajectory descriptor for each body joint is generated from RGB images, and then the collection of trajectories for all joints are processed by an adapted, pre-trained 2D convolutional neural network (CNN) (e.g., VGG-19 or ResNeXt-50) to learn the relationship amongst the different body parts and deliver a score for the movement quality. We release the only publicly-available, multi-view, non-skeleton, non-mocap, rehabilitation movement dataset (QMAR), and provide results for both cross-subject and cross-view scenarios on this dataset. We show that VI-Net achieves average rank correlation of 0.66 on cross-subject and 0.65 on unseen views when trained on only two views. We also evaluate the proposed method on the single-view rehabilitation dataset KIMORE and obtain 0.66 rank correlation against a baseline of 0.62.

Entities: Chemical Disease Gene Species

Keywords: health monitoring; movement analysis; view-invariant convolutional neural network (CNN)

Year: 2020 PMID： 32942561 PMCID： PMC7570706 DOI： 10.3390/s20185258

Source DB: PubMed Journal: Sensors (Basel) ISSN： 1424-8220 Impact factor: 3.576

1. Introduction

Beyond the realms of action detection and recognition, action analysis includes the automatic assessment of the quality of human action or movement, for example, in sports action analysis [1,2,3,4], skill assessment [5,6], and patient rehabilitation movement analysis [7,8]. For example, in the latter application, clinicians observe patients performing specific actions in the clinic, such as walking or sitting-to-standing, to establish an objective marker for their level of functional mobility. By automating such mobility disorder assessment using computer vision, health service authorities can decrease costs, reduce hospital visits, and diminish the variability in clinicians’ subjective assessment of patients. Recent RGB (red, green, blue) based action analysis methods, such as References [2,3,4,6], are not able to deal with view-invariance when applied to viewpoints significantly different to their training data. To achieve some degree of invariance, some works such as References [7,8,9,10,11,12,13], have made use of 3D human pose obtained from (i) Kinect, (ii) motion capture, or (iii) 3D pose estimation methods. Although the Kinect can provide 3D pose efficiently in optimal conditions, it is dependent on several parameters, including distance and viewing direction between the subject and the sensor. Motion capture systems (mocaps) tend to be highly accurate and view-invariant, but obtaining 3D pose by such means is expensive and time consuming, since it requires specialist hardware, software, and setups. These make mocaps unsuitable for use in unconstrained home or clinical or sports settings. Recently, many deep learning methods, for example, References [14,15,16,17,18], have been proposed to extract 3D human pose from RGB images. Such methods (a) either do not deal with view-invariance and are trained from specific views on their respective datasets (for example, References [14,17] show that their methods fail when they apply them on poses and view angles which are different from their training sets), (b) or if they handle view-invariance, such as References [19,20], then they need multiple views for training. To the best of our knowledge, there is no existing RGB-based, view-invariant method that assesses the quality of human movement. We argue here that using temporal pose information from RGB, can be repurposed, instead of skeleton points, for view-invariant movement quality assessment. In the proposed end-to-end View-Invariant Network (VI-Net in Figure 1), we stack temporal heatmaps of each body joint (obtained from OpenPose [21]) and feed them into our view-invariant trajectory descriptor module (VTDM). This applies a 2D convolution layer that aggregates spatial poses over time to generate a trajectory descriptor map per body joint, which is then forged to be view-invariant by deploying the Spatial Transformer Network [22]. Next, in our movement score module (MSM), these descriptor maps for all body joints are put through an adapted pre-trained 2D convolution model, such as VGG-19 [23] or ResNeXt-50 [24], to learn the relationship amongst the joint trajectories and estimate a score for the movement. Note, OpenPose has been trained on 2D pose datasets which means that our proposed method implicitly benefits from joint labelling.

Figure 1

VI-Net has an view-invariant trajectory descriptor module (VTDM) and a movement score module (MSM) where the classifier output corresponds to a quality score.

Initially, we apply our method to a new dataset, called QMAR (dataset and code can be found at https://github.com/fsardari/VI-Net), that includes multiple camera views of subjects performing both normal movements and simulated Parkinsons and Stroke ailments for walking and sit-to-stand actions. We provide cross-subject and cross-view results on this new dataset. Recent works such as References [25,26,27,28], provide cross-view results only when their network is trained on multiple views. As recently noted by Varol et al. [29], a highly challenging scenario in view-invariant action recognition would be to obtain cross-view results by training from only one viewpoint. While we present results using a prudent set of two viewpoints only within a multi-view training scenario, we also rise to the challenge to provide cross-view results by training solely from a single viewpoint. We also present results on the single-view rehabilitation dataset KIMORE [30] which provides 5 different types of lower back exercises in real patients suffering from Parkinsons, Stroke, and back pain. This work makes a number of contributions. We propose the first view-invariant method to assess quality of movement from RGB images and our approach does not require any knowledge about viewpoints or cameras during training or testing. Further, it is based on 2D convolutions only which is computationally cheaper than 3D temporal methods. We also present an RGB, multi-view, rehabilitation movement assessment dataset (QMAR) to both evaluate the performance of the proposed method and provide a benchmark dataset for future view-invariant methods. The rest of the paper is organized as follows. We review the related works in Section 2 and our QMAR dataset in Section 3. Then, we present our proposed network in Section 4 and experimental results in Section 5. Finally, in Section 6, we conclude our work, discuss some of its limitations, and provide directions for future research.

2. Related Work

Action analysis has picked up relative pace only in recent years with the majority of works covering one of either physical rehabilitation, sport scoring, or skill assessment [13]. Here, we first consider example non-skeleton based methods (which are mainly on sport scoring), and then review physical rehabilitation methods as it is the main application focus of our work. Finally, given the lack of existing view-invariant movement analysis techniques, we briefly reflect on related view-invariant action recognition approaches. Non-Skeleton Movement Analysis—A number of works have focused on scoring sports actions. Pirsiavash et al. [31] proposed a support vector machine (SVM) based method, trained on spatio-temporal features of body poses, to assess the quality of diving and figure-skating actions. Although their method estimated action scores better than human non-experts, it was less accurate than human expert judgments. More recently, deep learning methods have been deployed to assess the quality of sport actions in RGB-only data, such as References [1,2,3,4,32,33]. For example, Li et al. [1] divided a video into several clips to extract their spatio-temporal features by differently weighted C3D [34] networks and then concatenated the features for input to another C3D network to predict action scores. Parmar and Morris presented a new dataset and also used a C3D network to extract features for multi-task learning [3]. The authors of References [4,33] propose I3D [35] based methods to analyse human movement. Pan et al. [4] combine I3D features with pose information by building join relation graphs to predict score movement. Tang et al. [33] proposed a novel loss function which addresses the intrinsic score distribution uncertainty of sport actions in the decisions by different judges. The use of 3D convolutions imposes a hefty memory and computational burden, even for a relatively shallow model, which we avoid in our proposed method. Furthermore, the performance of these methods are expected to drop significantly when they are applied on a different viewpoint since they are trained on appearance features which change drastically in varying viewpoints. Rehabilitation Movement Assessment—Several works have focussed on such movement assessment, for example, References [7,8,9,10,36,37]. For example, Crabbe et al. [9] proposed a CNN network to map a depth image to high-level pose in a manifold space made from skeleton data. Then, the high level poses were employed by a statistical model to assess quality of movement for walking on stairs. In Reference [7], Sardari et al. extended the work in Reference [9] by proposing a ResNet-based model to estimate view-invariant high-level pose from RGB images where the high-level pose representation was derived from 3D mocap data using manifold learning. The accuracy of their proposed method was good when training was performed from all views, but dropped significantly on unseen views. Liao et al. [8] proposed a long short term memory (LSTM) based method for rehabilitation movement assessment from 3D mocap skeleton data and proposed a performance metric based on Gaussian mixture models to estimate their score. Elkholy et al. [37] extracted spatio-temporal descriptors from 3D Kinect skeleton data to assess the quality of movement for walking on stairs, sit-down, stand-up, and walking actions. They first classified each sequence into normal and abnormal by making a probabilistic model from descriptors derived from normal subjects, and then scored an action by modeling a linear regression on spatio-temporal descriptors of movements with different scores. Khokhlova et al. [10] proposed an LSTM-based method to classify pathological gaits from Kinect skeleton data. They trained several bi-directional LSTMs on different training/validation sets of data. For classification, they computed the weighted mean of the LSTM outputs. All the methods that rely on skeleton data are either unworkable or difficult to apply to in-the-wild scenarios for rehabilitation (or sports or skills) movement analysis. View-Invariant Action Recognition—As stated in References [26,29,38] amongst others, the performance of action recognition methods, such as References [34,35,39,40,41] to name a few, drops drastically when they test their models from unseen views, since appearance features change significantly in different viewpoints. To overcome this, some works have dealt with viewpoint variations through skeleton data, for example, References [38,42,43,44]. For example, Rahmani et al. [38] train an SVM on view-invariant feature vectors from dense trajectories of multiple views in mocap data via a fully connected neural network. Zhang et al. [44] developed a two-stream method, one LSTM and one convolutional model, where both streams include a view adaptation and a classification network. In each case, the former network was trained to estimate the transformation parameters of 3D skeleton data to a canonical view, and the latter classified the action. Finally, the output of the two streams were fused by weighted averaging of the two classifiers’ outputs. As providing skeleton data is difficult for in-the-wild scenarios, others such as References [25,26,27,29,45] have focused on generating view-invariant features from RGB-D data. Li et al. [26] extract unsupervised view-invariant features by designing a recurrent encoder network which estimated 3D flows from RGB-D streams of two different views. In Reference [29], the authors generated synthetic multi-view video sequence from one view, and then trained a 3D ResNet-50 [40] on both synthetic and real data to classify actions. Among these methods, Varol et al. [29] is the only work that provides cross-view evaluation through single-view training, resulting in accuracy on the UESTC dataset [46], which then was increased to when they used additional synthetic multi-view data for training.

3. Datasets

There are many datasets for healthcare applications, such as References [8,30,37,47,48], which are single-view and only include depth and/or skeleton data. To the best of our knowledge, there is no existing dataset (bar one) that is suitable for view-invariant movement assessment from RGB images. The only known multi-view dataset is SMAD, used in Sardari et al. [7]. Although it provides RGB data recorded from 4 different views, it only includes annotated data for a walking action and the subjects’ movements are only broadly classified into normal/abnormal, without any scores. Thus it is not a dataset we could use for comparative performance analysis. Next, we first introduce our new RGB multi-view Quality of Movement Assessment for Rehabilitation dataset, QMAR. Then, we give the details of a recently released rehabilitation dataset KIMORE [30], a single-view dataset that includes RGB images and score annotations, making it suitable for single-view evaluation.

3.1. QMAR

QMAR was recorded using 6 Primesense cameras with 38 healthy subjects, 8 female and 30 male. Figure 2 shows the position of the 6 cameras - 3 different frontal views and 3 different side views. The subjects were trained by a physiotherapist to perform two different types of movements while simulating two ailments, resulting in four overall possibilities: a return walk to approximately the original position while simulating Parkinsons (W-P), and Stroke (W-S), and standing up and sitting down with Parkinson (SS-P) and Stroke (SS-S). The dataset includes RGB and depth (and skeleton) data, although in this work we only use RGB. As capturing depth data from the 6 Primesense cameras was not possible due to infrared interference, the depth and skeleton data were retained from only view 2 at and view 5 at .

Figure 2

Typical camera views in the QMAR dataset with each one placed at a different height.

The movements in QMAR were scored by the severity of the abnormality. The score ranges were 0 to 4 for W-P, 0 to 5 for W-S and SS-S, and 0 to 12 for SS-P. A score of 0 in all cases indicates a normally executed action. Sample frames from QMAR are shown in Figure 3. Table 1 details the quality score or range and the number of frames and sequences for each action type. Table 2 details the number of sequences for each score.

Figure 3

Sample frames from QMAR dataset, showing all 6 views for (top row) walking with Parkinsons (W-P), (second row) walking with Stroke (W-S), (third row) sit-stand with Parkinsons (SS-P), and (bottom row) sit-stand with Stroke.

Table 1

Details of the movements in the QMAR dataset.

Action		Quality Score	# Sequences	#Frames/Video Min-Max	Total Frames
W	Normal	0	41	62–179	12,672
W-P	Abnormal	1–4	40	93–441	33618
W-S	Abnormal	1–5	68	104–500	57,498
SS	Normal	0	42	28–132	9250
SS-P	Abnormal	1–12	41	96–558	41,808
SS-S	Abnormal	1–5	74	51–580	47,954

Table 2

Details of abnormality score ranges in the QMAR dataset.

	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10	#11	#12
Action	#1	#2	#3	#4	#5	#6	#7	#8	#9	#10	#11	#12
W-P	4	8	16	12	-	-	-	-	-	-	-	-
W-S	10	14	19	15	10	-	-	-	-	-	-	-
SS-P	1	1	6	8	4	4	4	3	3	1	2	4
SS-S	3	19	19	13	20	-	-	-	-	-	-	-

3.2. KIMORE

This is the only RGB single-view rehabilitation movement dataset where the quality of movements have been annotated for quantitative scores. KIMORE [30] has 78 subjects (44 healthy, and 34 real patients suffering from Parkinson, Stroke, and back pain) performing five types of rehabilitation exercises (Ex #1 to Ex #5) for lower-back pain. All videos are frontal view - see sample frames in Figure 4.

Figure 4

Sample frames of KIMORE for five different exercises.

KIMORE [30] provides two types of scores, and , with values in the range 0 to 50 for each exercise as defined by clinicians. and represent the motion of upper limbs and physical constraints during the exercise respectively.

4. Proposed Method

Although its appearance changes significantly when we observe an instance of human movement from different viewpoints, the 2D spatio-temporal trajectories generated by body joints in a sequence are affine transformations of each other. For example, see Figure 5, where the trajectory maps of just the feet joints appear different in orientation, spatial location and scale. Thus, our hypothesis is that by extracting body joint trajectory maps that are translation, rotation, and scale invariant, we should be able to assess the quality of movement from arbitrary viewpoints one may encounter in-the-wild.

Figure 5

Walking example—all six views, and corresponding trajectory maps for feet.

The proposed VI-Net network has a view-invariant trajectory descriptor module (VTDM) that feeds into a subsequent movement score module (MSM) as shown in Figure 1. In VTDM, first a 2D convolution filter is applied on stacked heatmaps of each body joint over the video clip frames to generate a trajectory descriptor map. Then, the Spatial Transformer Network (STN) [22] is applied to the trajectory descriptor to make it view-invariant. The spatio-temporal descriptors from all body joints are then stacked as input into the MSM module, which can be implemented by an adapted, pre-trained CNN to learn the relationship amongst the joint trajectories and provide a score for the overall quality of movement. We illustrate the flexibility of MSM by implementing two different pre-trained networks, VGG-19 and ResNeXt-50, and compare their results. VI-Net is trained in an end-to-end manner. As the quality of movement scores in our QMAR dataset are discrete, we use classification to obtain our predicted score. Table 3 carries further details of our proposed VI-Net.

Table 3

VI-Net’s modules: : n 2D convolution filters with size d and channel size, : 2D max pooling with size d, : FC layer with N outputs. T is the # of clip frames, J is the # of joints and S is maximum score for a movement type.

	VTDM	MSM (Adapted VGG-19 or ResNeXt-50)
VI-Net	1st layer:{C2(3×3,T)}×1, BN, ReLU	1st layer VGG-19: {C2(3×3,J)}×64, BN, ReLU
	Localisation Network:	1st layer ResNeXt-50:
	{C2(5×5,1)}×10,{MP(2×2)}, ReLU,	{C2(7×7,J)}×64, {MP(3×3)}, ReLU
	{C2(5×5,10)}×10,{MP(2×2)}, ReLU,	Middle layers: As in VGG-19/ResNeXt-50
	{FC(32)}, ReLU, {FC(4)}	Last layer: {FC(S+1)}

Generating a Joint Trajectory Descriptor— First, we extract human body joint heatmaps by estimating the probability of each body joint at each image pixel, per frame for a video clip with T frames, by applying OpenPose [21]. Even though it may seem that our claim to be an RGB-only method may be undermined by the use of a method which was built by using joint labelling, the fact remains that OpenPose is used in this work as an existing tool, with no further joint labelling or recourse to non-RGB data. Other methods, e.g., Reference [49], which estimate body joint heatmaps from RGB images can equally be used. To reduce computational complexity, we retain the first 15 joint heatmaps of the BODY-25 version of OpenPose. This is further motivated by the fact, highlighted in Reference [47], that the remaining joints only provide repetitive information. Then, for each body joint , we stack its heatmaps over the T-frame video clip to get the 3D heatmap of size which then becomes the input to our VTDM module. To obtain a body joint’s trajectory descriptor , the processing in VTDM starts with the application of a convolution filter on to aggregate its spatial poses over time, that is, where is of size . We experimented with both 2D and 3D convolutions, and found that a 2D convolution filter yields the best results. Forging a View-Invariant Trajectory Descriptor— In the next step of the VTDM module, we experimented with STN [22], DCN [50,51], and ETN [52] networks, and found STN [22] the best performing option to forge a view-invariant trajectory descriptor out of . STN can be applied to feature maps of a CNN’s layers to render the output translation, rotation, scale, and shear invariant. It is composed of three stages. At first, a CNN-regression network, referred to as the localisation network, is applied to our joint trajectory descriptor to estimate the parameters for a 2D affine transformation matrix, . Instead of the original CNN in Reference [22], which applied 32 convolution filters followed by two fully connected (FC) layers, we formulate our own localisation network made up of 10 convolution filters followed by two FC layers. The rationale for this is that our trajectory descriptor maps are not as complex as RGB images, and hence fewer filters are sufficient to extract their features. The details of our localisation network’s layers are provided in Table 3. Then, in the second stage, to estimate each pixel value of our view-invariant trajectory descriptor , a sampling kernel is applied on specific regions of , where the centres of these regions are defined on a sampling grid. This sampling grid is generated from a general grid and the predicted transformation parameters, such that where are the centers of the regions of the sampling kernel is applied to, in order to generate the new pixel values of the output feature map . Jaderberg et al. [22] recommend the use of different types of transformations to generate the sampling grid based on the problem domain. In VTDM, we use the 2D affine transformations shown in Equation (2). Finally, the sampler takes both and to generate a view-invariant trajectory descriptor from at the grid points by bilinear interpolation. Assessing the Quality of Human Movement— In the final part of VI-Net (see Figure 1), the collection of view-invariant trajectory descriptors for joints , are stacked into a global descriptor and passed through a pre-trained network in the MSM module to assess the quality of movement of the joints. VGG-19 and ResNeXt-50 were chosen for their state-of-the-art performances, popularity, and availability. For VGG-19, its first layer was replaced with a new 2D convolutional layer, including convolution filters with channel size J (instead of 3 used for RGB input images), and for ResNeXt-50 its first layer was replaced with convolution filters with channel size J. The last FC layer in each case was modified to allow movement quality scoring through classification where each score is considered as a class, that is, for a movement with maximum score S, where for W-P, for W-S and SS-S, and for SS-P movements. The last FC layer of VI-Net has output units. Although VGG-19/ResNeXt-50 were trained on RGB images, we still benefit from their pretrained weights, since our new first layers were initialised with their original first layer weights. The output of this modified layer has the same size as the output of the layer it replaces (Table 3), so the new layer is compatible with the rest of network. In addition, we normalize the pixel values of the trajectory heatmaps to be between 0 and 255, that is, the same as RGB images on which VGG and ResNeXt were trained on, and trajectory descriptor maps have shape and intensity variations - thus the features extracted from them would be as equally valid as for natural images on which VGG and ResNeXt operate.

5. Experiments and Results

We first report on two sets of experiments on QMAR to evaluate the performance of VI-Net to assess quality of movement, based around cross-subject and cross-view scenarios. Then, to show the efficiency of VI-Net on other datasets and movement types, we present its results also on the single-view KIMORE dataset. We used Pytorch on two GeForce GTX 750 GPUs. All networks were trained for 20 epochs using stochastic gradient descent optimization with initial learning rate of 0.001, and batch size 5. To evaluate the performance of the proposed method, we used Spearman’s rank correlation as used in References [1,3,4]. Dataset Imbalance— It can be seen from Table 1 and Table 2 that the number of sequences for score 0 (normal) is many more than the number of sequences for other individual scores, so we randomly selected 15 normal sequences for W-P, W-S, SS-S movements and 4 normal sequences for SS-P to mix with abnormal movements to perform all our experiments. To further address the imbalance, we applied offline temporal cropping to add new sequences. Network Training and Testing— For each movement type, the proposed network is trained from scratch. In both the training and testing phases, video sequences were divided into 16-frame clips (without overlaps). In training, the clips were selected randomly from amongst all video sequences of the training set, and passed to VI-Net. Then, the weights were updated following a cross-entropy loss, where is the dimensional output of the last fully connected layer and s is the video clip’s ground truth label/score. In testing, every 16-frame clip of a video sequence was passed to VI-Net. After averaging the outputs of the last fully connected layer across each class for all the clips, we then set the score for the whole video sequence as the maximum of the clip scores (see Figure 6), that is, where and M is the number of clips for a video.

Figure 6

Scoring process for a full video sequence in testing phase.

Comparative Evaluation— As we are not aware of any other RGB-based view-invariant method to assess quality of movement, we are unable to compare VI-Net’s performance to other methods under a cross-view scenario. However, for cross-subject and single-view scenarios, we evaluate against (a) a C3D baseline (fashioned after Parmar and Morris [3]) by combining the outputs of the C3D network to score a sequence in the test phase in the same fashion as in VI-Net, and (b) the pre-trained, fine-tuned I3D [35]. We also provide an ablation study for all scenarios by removing STN from VI-Net to analyse the impact of this part of the proposed method.

5.1. Cross-Subject Quality of Movement Analysis

In this experiment, all available views were used in both training and testing, while the subjects performing the actions were distinct. We applied k-fold cross validation where k is the number of scores for each movement. Table 4 shows that VI-Net outperforms networks based on C3D (after Reference [3]) and I3D [35] for all types of movements, regardless of whether VGG-19 or ResNeXt-50 are used in the MSM module. While I3D results are mostly competitive, C3D performs less well due to its shallower nature, and larger number of parameters, exacerbated by QMAR’s relatively small size. We show in Section 5.3 that C3D performs significantly better on a larger dataset.

Table 4

Comparative cross-subject results on QMAR. The bold numbers show the best result for each action type.

		Action	W-P	W-S	SS-P	SS-S	Avg
Method			W-P	W-S	SS-P	SS-S	Avg
Custom-trained C3D (after Reference [3])			0.50	0.37	0.25	0.54	0.41
Pre-trained I3D			0.79	0.47	0.54	0.55	0.58
VI-Net	VTDM+MSM (VGG-19)	w/o STN	0.81	0.49	0.57	0.74	0.65
	VTDM+MSM (VGG-19)	w STN	0.82	0.52	0.55	0.73	0.65
	VTDM+MSM (ResNeXt-50)	w/o STN	0.87	0.56	0.48	0.72	0.65
	VTDM+MSM (ResNeXt-50)	w STN	0.87	0.52	0.58	0.69	0.66

As ablation analysis, to test the effectiveness of STN, we present VI-Net’s results with and without STN in Table 4. It can be observed that the improvements with STN are not necessarily consistent across the actions since when all viewpoints are used in training, the MSM module gets trained on all trajectory orientations such that the effect of STN is often overridden.

5.2. Cross-View Quality of Movement Analysis

We evaluate the generalization ability of VI-Net on unseen views by using cross-view scenarios, that is, distinct training and testing views of the scene, while data from all subjects is utilised. We also make sure that each test set contains a balanced variety of scores from low to high. Recent works such as References [25,26,27,28], provide cross-view results only when their network is trained on multiple views. As recently noted by Varol et al. [29], a highly challenging scenario in view-invariant action recognition would be to obtain cross-view results by training from only one viewpoint. Therefore, we performed the training and testing for each movement type such that (i) we trained from one view only and tested on all other views (as reasoned in Section 1), and in the next experiment, (ii) we trained on a combination of one frontal view (views 1 to 3) and one side view (views 4 to 6) and tested on all other available views. Since for the latter case there are many combinations, we show results for only selected views: view 2 with all side views, and view 5 with all frontal views. Since in cross-view analysis all subjects are used in both training and testing, applying the C3D and I3D models would be redundant because they would simply learn the appearance and shape features of our participants in our study and their results would be unreliable. In QMAR, when observing a movement from the frontal views, there is little or almost no occlusion of relevant body parts. However, when observing from side views, occlusions resulting from missing or noisy joint heatmaps from OpenPose, can occur for a few seconds or less (short-term), or for almost the whole sequence (long-term). Short term occlusions are more likely in walking movements W-P and W-S, while long-term occlusions occur more often in sit-to-stand movements (SS-P and SS-S). The results of our view-invariancy experiments, using single views only in training, are shown in Table 5. It can be seen that for walking movements W-P and W-S, VI-Net is able to assess the movements from unseen views well, with the best results reaching and rank correlation respectively (yellow highlights), and only relatively affected by short term occlusions. However, for sit-to-stand movements SS-P and SS-S, the long-term occlusions during these movements affect the integrity of the trajectory descriptors and the performance of VI-Net is not as strong, with the best results reaching and respectively (orange highlights). Note, for all action types, when VI-Net has STN with adapted ResNeXt, it performs best on average.

Table 5

Cross-view results for all actions with single-view training. The bold numbers show the best result for each view of each action type; Yellow highlights: best results for W-P and W-S actions amongst all views, Orange highlights: best result for SS-P and SS-S actions amongst all views.

	View	VTDM+MSM(VGG-19)		VTDM+MSM(ResNeXt-50)			View	VTDM+MSM(VGG-19)		VTDM+MSM(ResNeXt-50)
	View	w/o STN	w STN	w/o STN	w STN		View	w/o STN	w STN	w/o STN	w STN
W-P	1	0.51	0.67	0.64	0.67	W-S	1	0.51	0.43	0.60	0.64
	2	0.69	0.66	0.58	0.72		2	0.47	0.54	0.55	0.62
	3	0.62	0.66	0.63	0.70		3	0.64	0.56	0.61	0.59
	4	0.67	0.64	0.72	0.72		4	0.60	0.59	0.60	0.66
	5	0.67	0.67	0.68	0.71		5	0.62	0.60	0.62	0.63
	6	0.69	0.72	0.69	0.73		6	0.46	0.40	0.53	0.60
	Avg	0.64	0.67	0.65	0.70		Avg	0.55	0.52	0.58	0.62
SS-P	1	0.30	0.32	0.25	0.25	SS-S	1	0.36	0.49	0.44	0.45
	2	0.27	0.31	0.31	0.32		2	0.47	0.40	0.56	0.56
	3	0.16	0.23	0.36	0.43		3	0.37	0.52	0.38	0.43
	4	0.10	0.34	0.44	0.49		4	0.38	0.34	0.41	0.54
	5	0.50	0.52	0.43	0.45		5	0.26	0.50	0.50	0.48
	6	0.41	0.24	0.48	0.44		6	0.21	0.28	0.13	0.16
	Avg	0.29	0.32	0.37	0.39		Avg	0.34	0.42	0.40	0.43

Table 6 shows the results for each movement type when one side view and one frontal view are combined for training. VI-Net’s performance improves compared to the single-view experiment above with the best results reaching and for W-P and W-S movements (green highlights) and and for SS-P and SS-S movements (purple highlights), because the network is effectively trained with both short-term and long-term occluded trajectory descriptors. These results also show that on average VI-Net performs better with adapted ResNeXt-50 for walking movements (W-P and W-S) and with adapted VGG-19 for sit-to-stand movements (SS-P and SS-S). This is potentially because ResNext-50’s variety of filter sizes are better suited to the variation in 3D spatial changes of joint trajectories inherent in walking movements compared to VGG-19’s filters which can tune better to the more spatially restricted sit-to-stand movements. We also note that the fundamental purpose of STN in VI-Net is to ensure efficient cross-view performance is possible when the network is trained from a single-view only. It would therefore be expected and plausible that STN’s effect would diminish as more views are used since the MSM module gets trained on more trajectory orientations (which we verified experimentally by training with multiple views).

Table 6

Cross-view results for all actions with two-view training. The bold numbers show the best result for each combination of views of each action type; Green highlights: best results for W-P and W-S actions amongst all view combinations, Purple highlights: best results for SS-P and SS-S actions amongst all view combinations.

	View	VTDM+MSM(VGG-19)		VTDM+MSM(ResNeXt-50)			View	VTDM+MSM(VGG-19)		VTDM+MSM(ResNeXt-50)
	View	w/o STN	w STN	w/o STN	w STN		View	w/o STN	w STN	w/o STN	w STN
W-P	2,4	0.77	0.81	0.87	0.89	W-S	2,4	0.58	0.72	0.81	0.73
	2,5	0.72	0.75	0.90	0.92		2,5	0.74	0.74	0.80	0.81
	2,6	0.75	0.76	0.73	0.77		2,6	0.64	0.67	0.74	0.68
	1,5	0.70	0.76	0.80	0.75		1,5	0.70	0.68	0.83	0.81
	3,5	0.73	0.79	0.87	0.84		3,5	0.66	0.66	0.82	0.79
	Avg	0.73	0.77	0.83	0.83		Avg	0.66	0.69	0.80	0.76
SS-P	2,4	0.55	0.52	0.41	0.46	SS-S	2,4	0.57	0.64	0.54	0.64
	2,5	0.60	0.53	0.49	0.46		2,5	0.62	0.56	0.63	0.61
	2,6	0.48	0.35	0.36	0.42		2,6	0.50	0.62	0.48	0.46
	1,5	0.46	0.55	039	0.52		1,5	0.64	0.53	0.48	0.58
	3,5	0.61	0.40	0.43	0.47		3,5	0.62	0.60	0.63	0.67
	Avg	0.54	0.47	0.41	0.46		Avg	0.59	0.59	0.55	0.58

5.3. Single-View Quality of Movement Analysis

Next, we provide the results of VI-Net on the single-view KIMORE dataset, to illustrate that it can be applied to such data too. KIMORE provides two types of scores, and (see Section 3.2) which have a strong correspondence to each other, such that if one is low for a subject, so is the other. Hence, we trained the network based on a single, summed measure to predict a final score ranging between 0 and 100 for each action type. We include of the subjects for training and retain the remaining for testing ensuring each set contains a balanced variety of scores from low to high. Table 7 shows the results of C3D baseline (after Reference [3]), pre-trained, fine-tuned I3D [35] and VI-Net on KIMORE. It can be seen that VI-Net outperforms the other methods for all movement types except for Exercise #3. VI-Net with adapted VGG-19 performs better than with ResNeXt-50 for all movement types. This may be because, similar to sit-to-stand movements in QMAR, where VI-Net performs better with VGG-19, all movements types in KIMORE are also performed at the same location and distance from camera, and thus carry less variation in 3D trajectory space. This shows that our results are consistent in this sense across both datasets.

Table 7

Comparative results on the single-view KIMORE dataset. The bold numbers show the best result for each action type.

		Action	Ex #1	Ex #2	Ex #3	Ex #4	Ex #5	Average
Method			Ex #1	Ex #2	Ex #3	Ex #4	Ex #5	Average
Custom-trained C3D (after Reference [3])			0.66	0.64	0.63	0.59	0.60	0.62
Pre-trained I3D			0.45	0.56	0.57	0.64	0.58	0.56
VI-Net	VTDM+MSM (VGG-19)	w/o STN	0.63	0.50	0.55	0.80	0.76	0.64
	VTDM+MSM (VGG-19)	w STN	0.79	0.69	0.57	0.59	0.70	0.66
	VTDM+MSM (ResNeXt-50)	w/o STN	0.55	0.42	0.33	0.62	0.57	0.49
	VTDM+MSM (ResNeXt-50)	w STN	0.55	0.62	0.36	0.58	0.67	0.55

In addition, although all sequences in both training and testing sets have been captured from the same view, VI-Net’s performance on average improves with STN. This can be attributed to STN improving the network generalization on different subjects. Also, unlike in QMAR’s cross-subject results where C3D performed poorly, the results on KIMORE for C3D are promising because KIMORE has more data to help the network train more efficiently.

6. Conclusions

View-invariant human movement analysis from RGB is a significant challenge in action analysis applications, such as sports, skill assessment, and healthcare monitoring. In this paper, we proposed a novel RGB based view-invariant method to assess the quality of human movement which can be trained from a relatively small dataset and without any knowledge about viewpoints used for data capture. We also introduced QMAR, the only multi-view, non-skeleton, non-mocap, rehabilitation movement dataset to evaluate the performance of the proposed method, which may also serve well for comparative analysis for the community. We demonstrated that the proposed method is applicable to cross-subject, cross-view, and single-view movement analysis by achieving average rank correlation 0.66 on cross-subject and 0.65 on unseen views when trained from only two views, and 0.66 on single-view setting. VI-Net’s performance drops in situations where long-term occlusions occur, since OpenPose fails in such cases to produce sufficiently consistent heatmaps - but in general many methods suffer from long-term occlusions, so such failure is expected. Another limitation of VI-Net is that it has to be trained separately for each movement type. For future work, we plan to apply 3D pose estimation methods to generate more robust joint heatmaps which would also be less troubled by occlusions. We also plan to develop multitask learning so that the network can recognize the movement type and its score simultaneously. Moreover, we aim to improve the performance of our method on unseen views by unsupervised training of view-invariant features from existing multi-view datasets for transfer to our domain.

10 in total

1. The KIMORE Dataset: KInematic Assessment of MOvement and Clinical Scores for Remote Monitoring of Physical REhabilitation.

Authors: Marianna Capecci; Maria Gabriella Ceravolo; Francesco Ferracuti; Sabrina Iarlori; Andrea Monteriu; Luca Romeo; Federica Verdini
Journal: IEEE Trans Neural Syst Rehabil Eng Date: 2019-06-14 Impact factor: 3.802

VI-Net-View-Invariant Quality of Human Movement Assessment.

1. Introduction

2. Related Work

3. Datasets

3.1. QMAR

3.2. KIMORE

4. Proposed Method

5. Experiments and Results

5.1. Cross-Subject Quality of Movement Analysis

5.2. Cross-View Quality of Movement Analysis

5.3. Single-View Quality of Movement Analysis

6. Conclusions

1. The KIMORE Dataset: KInematic Assessment of MOvement and Clinical Scores for Remote Monitoring of Physical REhabilitation.

2. Normal and pathological gait classification LSTM model.

3. Automated robot-assisted surgical skill evaluation: Predictive analytics approach.

4. Domain Generalization and Adaptation Using Low Rank Exemplar SVMs.

5. View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition.

6. Learning a Deep Model for Human Action Recognition from Novel Viewpoints.

7. Efficient and Robust Skeleton-Based Quality Assessment and Abnormality Detection in Human Action Performance.

8. A Deep Learning Framework for Assessing Physical Rehabilitation Exercises.

9. A Data Set of Human Body Movements for Physical Rehabilitation Exercises.

Review 10. A Survey of Vision-Based Human Action Evaluation Methods.