Literature DB >> 34258462

Pedestrian attribute recognition using trainable Gabor wavelets.

Imran N Junejo¹, Naveed Ahmed², Mohammad Lataifeh².

Abstract

Surveillance cameras are everywhere keeping an eye on pedestrians or people as they navigate through the scene. Within this context, our paper addresses the problem of pedestrian attribute recognition (PAR). This problem entails the extraction of different attributes such as age-group, clothing style, accessories, footwear style etc. This is a multi-label problem with a host of challenges even for human observers. As such, the topic has rightly attracted attention recently. In this work, we integrate trainable Gabor wavelet (TGW) layers inside a convolution neural network (CNN). Whereas other researchers have used fixed Gabor filters with the CNN, the proposed layers are learnable and adapt to the dataset for a better recognition. We test our method on publicly available challenging datasets and demonstrate considerable improvements over state of the art approaches. Crown

Entities: Chemical Disease Gene Species

Keywords: Attribute recognition; Computer vision; Deep learning

Year: 2021 PMID： 34258462 PMCID： PMC8258859 DOI： 10.1016/j.heliyon.2021.e07422

Source DB: PubMed Journal: Heliyon ISSN： 2405-8440

Introduction

Being one of the active areas of research in computer vision, the pedestrian attribute recognition (PRA) deals with identifying several visual attributes from an image data. The identified attributes belong to different classes, e.g., clothing style, footwear, gender, age group etc. A successful outcome of this research can be applied to various domains. It can be employed for motion analysis [1], [2], where it can be used to identify crowd behavior attributes. Another important area of application is image-based surveillance or visual features extractions for person identification and tracking [3], [4], [5], all of which can lead to further applications such as video analytic for business intelligence, and person re-identification based on the extracted features [6]. Various factors add to the complexity of this challenge. One of the main factors is the changing lighting conditions. Attributes of the same type of clothing or objects can appear completely different under various lighting conditions. For example, distinguishing between black and dark blue colors is very difficult in certain weather conditions. Both colors will appear very similar to the camera in a darker environment. Occlusion also complicates the correct visual attribution identification and recognition [7]. Complete or partial occlusions occur due to the camera orientation, or from object self-occlusions. For example, if a person is wearing a hat, it might appear partially in the image, or its shape might be completely different. Similarly, the orientation of a person or a camera can hide a backpack partially or completely from the view. These examples clearly show that settings of an acquisition environment for image or video capture result in a high intra-class variation for the same visual attributes. The focus of this work is the identification of visual attributes from image and video data. The distance of an object from the camera affects how that object appears in an image. If the object is very far from the camera or if the image resolution is very low, a visual attribute for a dress, hat, backpack, scarf, shoes etc. will only occupy a few pixels in the image. The combination of low image resolution, in addition to the self or view-oriented occlusions, makes visual attribute identification a very challenging problem. Many of these issues can be seen in the most widely used pedestrian datasets. Fig. 1 shows some of the samples from the PEdesTrian Attribute (PETA) [8] and A Richly Annotated Pedestrian (RAP) [9] datasets. PETA is one of the largest benchmark datasets. It comprises of 19000 images of different resolution that cover more than 60 attributes. The dataset is acquired from real-world surveillance camera systems and includes images of persons. It is a very challenging dataset because of the acquisition setup and scene settings. As can be seen in Fig. 1, the quality of images is very low as well. This is mainly due camera resolution, acquisition conditions producing significant blur, and few occlusions that cause many attributes to remain hidden. RAP dataset comprises of 41 thousand images covering 72 attributes and is acquired from multiple viewpoints. The dataset shows a huge variation in the attributes due to pedestrian appearances, viewpoints, and severe occlusions. An analysis of these datasets reveals that visual attributes identification from these images is an extremely difficult task due to the very low image quality. Many of the attributes are largely occluded as well. Moreover, some of the objects appear quite blurred due to the fast motion or acquisition problems - adding more complexity to the problem.

Figure 1

(a) PETA [8] dataset Samples. (b) RAP [9] dataset samples.

(a) PETA [8] dataset Samples. (b) RAP [9] dataset samples. Visual attribute recognition problem can be solved in different ways, but the predominant solutions involve a two-step process. In the first step, a feature extraction algorithm is employed to find a feature representation of the attributes. Different feature extraction solutions are discussed in the computer vision literature. Most of these techniques require deep domain-knowledge and high-level expertise in fine tuning for an accurate representation of visual attributes. For feature representation, methods like SIFT [10], HoG [11] or Haar-like features [12] have been employed in the field rigorously. Feature extraction is followed by the attribute's classification step, for which, Support Vector Machines (SVM) [8] has been the most widely used technique in the last decade. In recent years, the convolutional neural networks (CNNs) have almost completely replaced SVMs for classification tasks. Compared to earlier attribute learning or image classification methods, CNNs are more effective and robust. In this work, we make use of the Gabor wavelets, which have been used in the computer vision literature extensively over the last many decades. However, there have been only few works that use the Gabor wavelets in conjunction with the CNNs. For the majority of the works that do employ these wavelets, filters are pre-constructed and fed as filters to the convolutional network. However, we adopt an approach where the convolutional network is employed to learn the wavelet parameters along with learning other neural network parameters. These Trainable Gabor wavelets (TGW) [13] layers make up for the backbone of our network. Each TGW layer accepts a single channel input, with a multi-channel output, and learns the best parameters to generate an adaptable set of Gabor filters. TGW layer contains a convolution layer that uses the steerability of Gabor wavelets to address orientation issues. We also use a regular convolutional layer to extract features from the input as well. These outputs from TGW and convolution layers are stacked together, which are referred to as mixed-layer, and make up the building block of our network. The proposed network, shown in Fig. 3, undergoes a series of fully connected (fc) layers that are connected to the final network output layer. The proposed network is simple and trainable with a standard gradient-decent method.

Figure 3

Our main contributions are: We for the first time make use of the trainable Gabor wavelets to the problem of pedestrian attribute recognition. We propose a novel network that, while learning the Gabor wavelet parameters, combines the learned wavelet features with the regular convolution layers. The proposed method is demonstrated to have better recognition results than state of the art on two of the most challenging public datasets.

Related work

In this section we discuss works that are in spirit similar to our method, a detailed survey can be found in Wang et al. recent survey [14]. PETA [8] is one of the most widely used pedestrian datasets. While introducing the dataset, the Deng et al. [8] use the luminance channel and apply Ensemble of Localized Features (ELF) and Gabor and Schmid filter on it. To address the class imbalance problem, they also apply ikSVMs [15] separately on each attribute. They exploit similarity between images using the Markov Random Field (MRF). In their representation, each image is a node and the link between two nodes is determined by the similarity between neighboring images. RAP dataset [9] is acquired from multiple viewpoints that introduces significant variations for the same attributes along with severe occlusions. They employed two CNN models based on Caffe framework [16] to analyze the impact of the variations introduced by different viewpoints and occlusions on the overall classification of the attributes. They trained SVMs in addition to the adopting ELF. Additionally, they divide the image into multiple blocks (three in their case) to employ a part-based classification scheme. For their work, the parts comprised of: upper body (torso), lower body, and head and shoulders. Joo et al. [17] proposed another approach that also employed part-based recognition. In their work, they first created Histogram of Oriented Gradient (HoG) features from an image subdivided into multiple overlapping regions. For attributes classification, they employed a Poselet-based approach [18]. Furthermore, Zhao et al. [19] proposed a solution based on a Recurrent Neural Network (RNN). In their work, they employed two end-to-end models: Recurrent Attention (RA) and Recurrent Convolutional (RC). The correlations between various attribute groups are mined by the RC model, while the intra-group attention, correlation, and spatial locality are used by the RA model to improve the performance and robustness of pedestrian attribute recognition. Nonetheless, their network has a very deep architecture, hence the number of parameters is quite large. In another part-based approach, Zhu et al. [20] proposed a CNN-based solution where the human body is divided into 15 parts, and a CNN is trained separately for each part. The contribution of each attribute determines the weight of the corresponding CNN. Zhou et al. [21] use GoogLeNet for the initial mid-level feature extraction from detection layers. The activation maps from these detection layers are clustered and fused to localize the pedestrian attributes. Only image labels are used to train the detected layers in order to learn the relationship between the mid-level features and the pedestrian attributes. Max-pooling is used in a weakly-supervised technique for object detection training. Similarly, Chen et al. [22] suggested a part-based network that combined LOMO features [23] with CNN extracted features. They showed that the Scale-Invariant Local Ternary Patterns and HSV histograms based LOMO features are illumination-invariant texture and color descriptors. Furthermore, a pose-guided model was also presented [24] based on pedestrian body structure knowledge. In the first step, the model computes transformation parameters to estimate the pose from an image. Based on the pose information, human body parts are localized, and the final attribute recognition is estimated by fusing multiple features. Another parts-localization method was offered by Liu et al. [25]. They proposed a Localization Guide Network (LGNet) that uses a CNN model based on Inception-v2 [26] for feature extraction. Afterwards, a global average pooling layer (GAP) is adopted to extract global features. The global and local features are fused to perform the pedestrian attributes classification. A visual semantic graph approach has also been presented [27], using ResNet-50 to for the pedestrian images feature extraction. Yet, having more than fifty layers, the proposed network contained a large number of parameters. Furthermore, a multi-branch approach has also been proposed using multi-colorspace input [28]. Sarfraz et al. [29] proposed an end-to-end CNN-based network (VeSPA). This network consists of four parts, where each part corresponds to a specific pose category. Pose-specific attributes of each category are learned by each of these network parts. Their work demonstrated that coarse body pose information greatly influences the pedestrian attribute recognition. This work was extended in [30] adding a ternary view classifier in a modified approach as feature maps were obtained using a global weighting solution prior to the final embedding. P-Net [31] employs a part-based approach using GoogLeNet. The location attributes for different body parts are estimated using refined convolutional feature maps. A joint person re-identification and attribute recognition approach (HydraPlus-Net) is presented by Liu et al. [32]. HydraPlus-Net is an Inception-based network and aggregates feature layers from multi-directional attention modules for the final feature representation. Sarafianos et al. [33] present a multi-branch network that addresses class imbalance problem by employing a trivial weighting scheme. The network is guided towards crucial body parts using the extracted visual attention masks. These visual attention masks are used to obtain an improved feature representation by fusing them at varying scales. Another end-to-end method for person attribute recognition that uses Class Activation Map (CAM) network [34] to refine attention, heat map is proposed by Guo et al. [35] where different image attributes are identified using CAM network to refine the attention heat map for an improved recognition. A Harmonious Attention CNN (HA-CNN) based joint learning approach for person re-identification is presented in [36]. Hard regional and soft pixel attention are learned in a combined manner using HA-CNN. Feature representation is obtained by this simultaneous optimization. A Multi-Level Factorization Net (MLFN) that identifies latent discriminative factors from visual appearance of a person is proposed by [37]. The multi-semantic levels factorization is done without manual annotation. A Transferable Joint Attribute-Identity Deep Learning (TJ-AIDL) model that allows for a simultaneous learning of an identity discrimination and attribute-semantic feature representation is proposed by [38]. Furthermore, Si et al. [39] proposed a Dual Attention Matching network (DuATM), which is a joint learning end-to-end person re-identification framework. Their method simultaneously performs context-aware feature sequences learning, and attentive sequence comparison in a joint learning mechanism for person re-identification. A Generative Adversarial Network based pose-normalized person re-identification framework is presented in [40]. They learn pose invariant deep person re-identification features using synthesized images. A deep CNN based method to learn partial descriptive features for efficient person feature representation is presented in [41]. They employed a pyramid spatial pooling module and reported an improvement of 2.71% on the PETA dataset over [29]. [42] improved over [29] by employing a deeper network based on a context sensitive framework. Their proposed network creates a richer feature set using deeper residual (ResNet) networks that improved generalization and classification accuracy. The resulting model achieved the best in class results on attribute recognition datasets. Additionally, [43] presented a visual semantic graph reasoning framework that modeled spatial and attribute relationships using two types of graphs. A Graph Convolutional Network that combines potential semantic relationships of the attributes, and spatial relationship between local regions to be used for the training. A dual model approach was also presented for pedestrian recognition [44] using Recurrent Attention (RA) and Recurrent Convolutional (RC). The RC model employed a Convolutional-LSTM model to establish the correlations between different groups of attributes. To improve the overall robustness, the RA model uses both local attention correlation and global spatial locality. Using Gabor wavelets with CNNs have received a tremendous attention as well [13], [45], [46], [47], [48]. Gabor filter bank was proposed as the first layer of a CNN [45], the bank gets updated using standard back-propagation network learning phase. Similarly, Gabor filters were used in the first layer [46], however, while introducing lateral inhibition to enhance network performance, a n-fold cross validation was used to search for the best parameters. Within this approach, a combination of HOG and Gabor filters were used for feature extraction while CNNs were deployed for detection in [47]. Authors in [48] introduce a Gabor Neural Network (GNN) where Gabor filters are incorporated into the convolution filter as a modulation process, in spirit similar to the above mentioned works. In contrast to the above works where fixed Gabor filters are used, [13] introduce a trainable Gabor wavelet (TGW) layer. The authors present a method where the hyperparameters of the wavelets are learned from the input and a novel convolution layers are employed to create steerable filters. In this paper, we propose using this TGW layer with our proposed CNN for a novel solution to the problem of PAR. Our method is tested on two of the most challenging datasets and shows a considerable improvement over state of the art approaches.

Main approach

In this section, we start with the description of the Gabor wavelet layer. Followed by the proposed architecture of the network.

Gabor wavelet layer

We make use of the Trainable Gabor wavelets (TGW) layer as proposed by Kwon et al. [13] (see. Fig. 2). A neural network is used to generate the hyperparameters for the Gabor wavelet, and the generated Gabor filters are applied to filter inputs. In order to capture essential input features, a convolution layer is added to the TGW layer to capture features at different orientations.

Figure 2

Trainable Gabor Wavelet (TGW) layer [13]: Inputs and outputs are multichannel. A neural network is used to generate Gabor wavelet hyperparameters. These generated Gabor filters are then applied to the input. 1 × 1 convolution layer is added to enable the steerability of the Gabor wavelets.

Hyperparameter estimation

The 2D Gabor wavelet can be described as: where γ represents aspect ratio, λ represents wavelength of the sinusoidal, σ represents width or the standard deviation, , , and θ is an angle in the range . Thus, in order to specify a continuous Gabor wavelet, we need to determine the set of hyperparameters . In order to convert the continuous filter to a discrete one, sampling grids need to be defined, which is largely linked to σ. A new parameter is thus introduced to compute the discrete filter: where m and n are in the interval , and by just varying , variety of sampling grids can be achieved [13]. For a loss function L, we need to compute in order to train for the wavelet layer that is cascaded with our CNN. In order to train for the ζ, what remains is to compute , as is handled automatically by the deep learning libraries: as . The remaining parameters , , can be computed in a similar way and a similar parameterization can be adopted for the parameters and λ. A very significant parameter for the Gabor wavelet is the orientation (θ). These values are mostly chosen empirically. This parameter is also made trainable to better design orientations for the task at hand. To use the steering property, where a linear combination of finite set of responses can be used to represent convolution at any orientation, a convolution layer, working as a linear combination layer, is added to the output of the generated filters. For this layer, ten equally spaced fixed orientations are selected, working as basis filters: , and [13].

Attribute recognition network

The above mentioned TGW layer can be thought of as a feature extracting layer. In addition to this, we also employ it as the key building block of our network. Thus, in addition to functioning as the lowest layer, it also aids the network to learn high level features. The proposed network is shown in Fig. 3. An input image is first converted to a grayscale and then passes through a series of mixed-layers: combination of TGW layer and a convolution layer. The input to the TGW layer starts with a 1-channel conversion, i.e., a multi-channel input is converted to a 1-channel, which is a summation over the channel's operation for all layers except the first layer where we perform a simple color-to-gray image conversion. The parameters for these layers are given in Table 1.

Table 1

Parameters used for the TGW layers.

Layer	γ_o	λ_o	σ_o	ζ_o	TGW Channels	Conv Channels
1	0.3	6.8	5.4	6	128	128
2	0.3	5.6	4.5	5	128	128
3	0.3	4.6	3.6	4	128	128
4	0.3	3.5	2.8	3	128	128
5	0.3	2.5	2.0	2	128	128
6	0.3	2.5	2.0	2	128	128

Our Approach: The input images go through a series of 6 mixed-layers. The output of layer six is followed by three fc layers. Size of the last layer of the network matches the number of dataset attributes. Parameters of the network are mentioned in Table 1. Parameters used for the TGW layers. Each mixed-layer (1 to 6) contains 128 channels from the TGW layer and 128 channels from a convolution layer (denoted as 3Conv). Thus, depth of each mixed-layer block output is 256 (concatenation of TGW and 3Conv layer). For each 3Conv layer, as the name suggest, the kernel size is . The convolution is followed by LeakyReLU activation function, max-pool layer (size ), and a Batch Normalization (BN) layer. The size of an input image to each of these stacked layers is, respectively: , , , , , and . The mixed-layers are followed by three fully connected layers, i.e., fc1, fc2 and output, of size 512, 512 and 35 for PETA or 51 for RAP, respectively. Each fc layer uses LeakyReLU(0.01) as the activation function, followed by a dropout layer (), to minimize the number of parameters in the network. The final layer size matches the number of attributes of the dataset. The method proposes using Gabor wavelets merged with a deep neural network. Whereas other methods construct Gabor filters manually, proposed network learns the wavelet parameters, suitable to the dataset. Generated Gabor filters are stacked with convolution layers to build the overall network. As we shall show next, the proposed network is efficient and learns the structure of the dataset well to perform at par with state of the art.

Evaluation

As mentioned above, following the channel conversion, the grayscale image is processed through mixed-layers. Each mixed-layer consists of equal number of channels from TGW and 3Conv layer. Depth of each mixed layer output is 256. The mixed-layers are followed by a series of fully connected layers before the final output layer. LeakyReLU(0.01) is used as the activation function for all the layers. The output layer uses sigmoid as the activation function. To evaluate our method quantitatively, we compute various measures and report the results below. Although mean accuracy has been widely used in the attribute recognition literature, it treats each attribute independently of the other attributes. This might not necessarily be the case and an inter-attribute correlation might exist. Therefore, researchers also report example-based evaluations, namely accuracy (Acc), precision (), recall (Rec), and F1 score (F1) [9].

Dataset

RAP and PETA are the most widely used datasets for the problem of pattern attribute recognition. Collected from real-time surveillance cameras, the PETA dataset contains images collected from 10 publicly available datasets. The resolution of the images ranges from to . Collected from a multi-camera setup of around 26 cameras, the RAP dataset contains pedestrian samples. Each attribute is annotated independently, and the size of the images range from to . Most of the previous works [24], [29] report results on the PETA dataset using only 35 attributes. Similarly, for the RAP dataset, results are reported on 51 datasets. In order to make a fair comparison, we adopt the same scheme and test/train on the same attributes. Similarly, for a fair comparison, experiments are conducted on 5 random splits: we allocate samples for training, samples for validation, samples for testing on the PETA dataset. For the RAP dataset, we split it randomly into training images and test images [29]. We adopted the weighted-cross entropy loss function [24] in order to mitigate the class imbalance problem. Similarly, following other researchers, images are resized to an image resolution of . Pre-processing: we start with what is known as the mean subtraction where mean is computed for all images (for each of the color channel) and subtracted from the image data. Similarly, we compute the standard deviations, the normalization step, for images (and their color channels) and divide image values by this statistic. These steps are crucial and are equivalent to centering the data around its origin.

Setup

For deep learning, we adopted the KERAS [49] library, which is based on the TensorFlow backend. All experiments were performed on a cluster node with 2 x Intel Xeon E5 CPU, 128 GB Registered ECC DDR4 RAM, 32TB SAS Hard drive storage, and 8 x NVIDIA Tesla K80 GPUs.

Implementation details

We train the network for 50 epochs. LeakyReLU was used as the activation function for all layers of the network with the parameter 0.01. We used the Adam for update optimizer using the parameters: learning rate = , and . We added the dropout layers to the fc layers to prevent model over-fitting. We adopt weight decay by a factor of 0.1 after 15 epochs. The batch size was set to be 8. All weights in the network are initialized using He Normal initialization. For the TGW layers with a steering block, we use the scheme suggested by [50]: we fix the parameters as shown in Table 1 while training for ζ. This setup yields the best results in our experiments.

Results

We evaluate the effectiveness of the proposed method on both PETA and RAP datasets. Table 2 shows a comparison of the proposed method with six current state of the art methods. For the PETA dataset, Acc obtained from our method is 80.04%. This is higher than all the other methods that we compare with. The obtained results for the other measures (Pre, Rec and F1) is 86.49%, 80.1%, and 82.32% respectively. Class-wise accuracy chart for the PETA dataset is shown in Fig. 4. Interestingly, the lowest accuracy is that for the class upperBodyOther. Considering the image resolutions in the dataset, this is indeed a very difficult class to accurately measure. On the other hand, the highest accuracy is that of the classes upperBodyThinStripes and upperBodyVNeck.

Table 2

	PETA [8]				RAP [9]
	Acc	Prec	Rec	F1	Acc	Prec	Rec	F1
Chen et al. [24]	75.07	83.68	83.14	83.41	62.02	74.92	76.21	75.56
Li et al. [9]	−	−	−	−	63.67	76.53	77.47	77.00
Sudowe et al. [51]	73.66	84.06	81.26	82.64	62.61	80.12	72.26	75.98
Liu et al. [21]	74.62	82.66	85.16	83.40	53.30	60.82	78.80	68.65
Sarfaraz et al. [29]	77.73	86.18	84.81	85.49	67.35	79.51	79.67	79.59
Li et al. [32]	76.13	84.92	83.24	84.07	65.39	77.33	78.79	78.05

ours	80.04	86.49	80.1	82.32	91.1	92.39	91.1	91.56

Figure 4

Class-wise Accuracy - PETA dataset: the figure shows the obtained class-wise accuracy. The highest accuracy is for the class upperBodyThinStripes,upperBodyVNeck. The lowest accuracy is 66.0% for the class upperBodyOther.

Quantitative results (%) on PETA and RAP datasets. Results are compared with the other benchmark methods. As can be seen, we have comparable results, with considerable improved accuracy for both the datasets. Class-wise Accuracy - PETA dataset: the figure shows the obtained class-wise accuracy. The highest accuracy is for the class upperBodyThinStripes,upperBodyVNeck. The lowest accuracy is 66.0% for the class upperBodyOther. For the RAP dataset, similar to the PETA dataset, the obtained results are exceedingly encouraging. The obtained accuracy is 91.1%, while we obtained 92.39%, 91.1%, and 91.56% for the remaining measure precision, recall, and F1-score. The obtained results are a considerable improvement over state of the art. One significant reason for this difference is primarily the large size of the RAP dataset. For the RAP dataset, class-wise accuracy is shown in the Fig. 5. The class BaldHead is recognized with a highest accuracy score while the two class that had a low score were that of Age17-30, Age31-45. These two classes, naturally, are very difficult to judge, even for experience human observers. Other low performing classes are: Jacket, OtherAttachments.

Figure 5

Class-wise Accuracy - RAP dataset: The lowest accuracy is that of the classes: Age17-30, Age31-45. The highest accuracy is for the class BaldHead.

Class-wise Accuracy - RAP dataset: The lowest accuracy is that of the classes: Age17-30, Age31-45. The highest accuracy is for the class BaldHead. The proposed method makes a novel use of the Gabor wavelet layers. Instead of manually constructing Gabor filters, the layers are trainable and are able to correctly estimate wavelet parameters. The method converts the input image into grayscale and then passes it through a series of six mixed-layers blocks that learn the best parameters for the generated Gabor filters. These mixed-layers are a combination of TGW and 3Conv layers. Output from the last mixed-layer passes through three fc layers. We have obtained very encouraging results for the key measures. The method is novel and unique in the sense that it does not resort to data augmentation or part-based computations, as employed by [9]. We eliminate the need to compute pose estimation [24], or construct any hand-crafted features [22]. The discussed results demonstrated superiority over state of the art and justifies the novel use of Gabor wavelet layers.

Conclusion

In this paper, we present a novel application of trainable Gabor wavelets to the problem of pedestrian attribute recognition. In contrast to creating offline Gabor filters for image feature extraction, the proposed network learns Gabor wavelets parameters from the data in our deep learning architecture. The network is simple, and has been tested extensively on two of the most challenging publicly available datasets. The results are encouraging and surpass state of the art over many key measures. For the future work, we intend to use these trainable Gabor wavelets with other emerging deep network architectures.

Declarations

Author contribution statement

I.N. Junejo: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper. N. Ahmed, M. Lataifeh: Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper.

Funding statement

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data availability statement

Data included in article/supplementary material/referenced in article.

Declaration of interests statement

The authors declare no conflict of interest.

Additional information

No additional information is available for this paper.

3 in total

1. Modelling pedestrian travel time and the design of facilities: a queuing approach.

Authors: Khalidur Rahman; Noraida Abdul Ghani; Anton Abdulbasah Kamil; Adli Mustafa; Md Ahmed Kabir Chowdhury
Journal: PLoS One Date: 2013-05-15 Impact factor: 3.240

2. A bio-inspired, motion-based analysis of crowd behavior attributes relevance to motion transparency, velocity gradients, and motion patterns.

Authors: Florian Raudies; Heiko Neumann
Journal: PLoS One Date: 2012-12-31 Impact factor: 3.240

3. A multi-branch separable convolution neural network for pedestrian attribute recognition.

Authors: Imran N Junejo; Naveed Ahmed
Journal: Heliyon Date: 2020-03-17

3 in total

Layer	γ_o	λ_o	σ_o	ζ_o	TGW Channels	Conv Channels
1	0.3	6.8	5.4	6	128	128
2	0.3	5.6	4.5	5	128	128
3	0.3	4.6	3.6	4	128	128
4	0.3	3.5	2.8	3	128	128
5	0.3	2.5	2.0	2	128	128
6	0.3	2.5	2.0	2	128	128

Layer	γ_o	λ_o	σ_o	ζ_o	TGW Channels	Conv Channels
1	0.3	6.8	5.4	6	128	128
2	0.3	5.6	4.5	5	128	128
3	0.3	4.6	3.6	4	128	128
4	0.3	3.5	2.8	3	128	128
5	0.3	2.5	2.0	2	128	128
6	0.3	2.5	2.0	2	128	128

Layer	γ_o	λ_o	σ_o	ζ_o	TGW Channels	Conv Channels
1	0.3	6.8	5.4	6	128	128
2	0.3	5.6	4.5	5	128	128
3	0.3	4.6	3.6	4	128	128
4	0.3	3.5	2.8	3	128	128
5	0.3	2.5	2.0	2	128	128
6	0.3	2.5	2.0	2	128	128