Multi-stream part-fused graph convolutional networks for skeleton-based gait recognition

Gait recognition, a task of identifying people through their walking pattern, has attracted more and more researchers' attention. At present, most skeleton-based gait recognition approaches extract gait features from merely joint coordinates. However, the information, e.g. bone and motion, is equally instructive and discriminative for gait recognition. Thus, this paper proposes a novel multi-stream part-fused graph convolutional network, MS-Gait, to fuse part-level information and capture multi-order features from skeleton data. To be specific, we integrate a channel attention learning mechanism into the graph convolutional networks (GCN) to improve the representational power. In addition, part-level information is merged by capturing features from the skeleton graph and its subgraphs concurrently. Finally, a multi-stream strategy is proposed to model joint, bone, and motion dynamics simultaneously, which is proven to effectively improve the recognition accuracy. On the popular CASIA-B dataset, extensive experiments demonstrate that our method can achieve state-of-the-art performance and is robust to confounding variations.


Introduction
In recent years, biometric recognition technology (Chen et al., 2021;Gupta & Sehgal, 2020;F. Liu et al., 2020) has made considerable progress and has been widely applied in the field of social security. As an emerging biometric technology, gait recognition, which makes use of the gait characteristic to authenticate the identities of people, has been paid attention and investigated widely due to its support for long-distance recognition and no requirement for subject cooperation. Studies have shown that gait is a complex behavioural characteristic that is impacted by the body shape, muscle strength, physical coordination, and other factors, which means that it is difficult for one person to imitate or disguise the way he walks. Therefore, thanks to the excelled discrimination, the recognition technology based on gait features has a huge industrial significance and considerable application prospects in many fields, such as video surveillance, criminal investigation, smart home and medical research, etc.
In the former studies on gait recognition, appearance-based methods (Han & Bhanu, 2005; Y. Zhang et al., 2019) that employ silhouette images extracted from videos through background subtraction are more prevailing during the past two decades, compared with the model-based methods. However, since the shape of silhouettes is highly sensitive to the changes of external conditions, such as clothing, carrying and viewing angle, the performance of such methods may decline sharply when the covariate conditions change. On the contrary, model-based approaches are more invariant to appearance variations since they extract gait features based on human body structure and movements. Specifically, some earlier works (Nixon et al., 1996;Tanawongsuwan & Bobick, 2001;Wang et al., 2004) tried to model the human body manually and to utilise the discrepancy between motion patterns of different body parts for identification. This is reasonable but not feasible since it is very challenging to locate and track body parts accurately. Recently, with the progress of human pose estimation (Z. Cao et al., 2019), model-based gait recognition approaches are gradually revealing their great potential and broad prospect.
Certain skeleton-based methods Liao et al., 2017Liao et al., , 2020 model human body as a sequence of coordinate vectors or a 2D grid, and then feed into LSTM or convolutional neural networks (CNNs) to generate the predicted labels. Actually, the skeleton is naturally structured as a graph in a non-Euclidean space, where nodes represent all the joints of human body and edges represent physical dependency between them. Obviously, the methods above cannot utilise such graph structure of the human skeleton and express physical dependency properly. Lately, graph convolutional networks, a kind of graph-based neural network, have achieved remarkable performance in many applications (Bruna et al., 2013;Defferrard et al., 2016;Kipf & Welling, 2016;Yan et al., 2018). For the skeleton-based gait recognition task, Li et al. (2020) first introduced GCNs to capture gait features from 2D joints. They construct a spatial-temporal graph through connecting two joints with physical dependency and the same joints between consecutive frames, and then apply GCN and TCN to learn spatial and temporal features alternately.
However, Li et al. (2020) defines the coordinate vectors of all 18 joints estimated from videos as the attribute of each frame. Since gait videos are generally taken from a long distance in the actual scene, not all joints, such as eyes and ears, can be accurately located and tracked. The deviations of these joints not only fail to effectively improve the performance of gait recognition, but will make it worse instead. Apart from that, current GCN-based methods Shopon et al., 2021;Teepe et al., 2021) simply treat the entire skeleton graph as the input of networks, neglecting the fact that the human skeleton is a combination of multiple body parts. However, in the latest study (Fan et al., 2020), partlevel features that can represent the importance and nexus of body parts have been proven to be also critical for gait recognition. Moreover, existing model-based gait recognition approaches Li et al., 2020;Liao et al., 2017Liao et al., , 2020Shopon et al., 2021) only exploit the 2D or 3D coordinates of human joints. The missed the bone and motion information between joints are also instructive and discriminative for gait recognition, which represents the length and orientation information between joints and the movement and speed information of joints between frames, respectively. Motivated by the above issue, we propose a novel multi-stream part-fused graph convolutional network named MS-Gait in this paper. Concretely, we integrate Squeeze-and-Excitation (SE) blocks that can learn to use global information to selectively enhance informative features and suppress less useful ones into graph convolutional networks that have shown great potential in skeleton-based gait recognition to improve the quality of representations produced by the model. Besides, we deploy two pathways to concurrently capture features from the whole skeleton graph as well as its subgraphs, which can improve the performance significantly. Finally, in order to fuse the bone and motion information of skeleton data, we calculate the bone as vectors pointing from the source joint to the target joint, with the motion regarded as the difference of coordinates along the temporal dimension. A multi-stream strategy is designed to realise the integration of all these information.
The main contributions of this paper are summarised as follows: (1) We integrate a channel attention learning mechanism into the graph convolutional networks by using the Squeeze-and-Excitation blocks, which can improve the representation ability of the GCN for the gait recognition task. (2) We propose a part-fused framework to simultaneously capture the multi-granularity features from the skeleton graph and its subgraphs.
(3) A multi-stream feature extraction strategy is designed to jointly model the joint, bone, and motion information, which is proven to effectively improve the recognition accuracy.

Gait Recognition
The prior works can be divided into two categories. On the one hand, appearance-based methods (Han & Bhanu, 2005;Yu, Chen, Garcia Reyes, et al., 2017;Yu, Chen, Wang, et al., 2017) directly extract features from the human silhouettes for recognition. Gait energy image (GEI) (Han & Bhanu, 2005), defined by weighted averaging the contour images in a period, is the most popular gait template due to its low computational cost and relatively high recognition rate. Nevertheless, such method of averaging silhouettes loses the temporal information to a certain extent. Some recent works attempt to learn more discriminative features from silhouette sequence. For instance, Chao et al. (2019) regards gait as a set consisting of independent frames, exhibiting significant robustness in various complex scenarios. Although these methods can handle cross-view tasks well, the shape of silhouettes that they take as input is greatly affected by clothing and carrying conditions. Hence, gait recognition under cross-carrying and cross-wearing is still a great challenge.
On the other hand, model-based methods obtain feature expression by modelling the human body structure. In the case of accurate modelling, this type of method can effectively overcome the interference caused by occlusion and viewing angle since the joint position generally will not change with varied covariate conditions. Some earlier methods mark different body parts manually (Tanawongsuwan & Bobick, 2001) or apply certain specific devices to local and track the joints (Andersson & Araujo, 2015;Kastaniotis et al., 2016;Yang et al., 2016). Recently, as the research on pose estimation has made great progress, some works try to exploit them for gait recognition. Feng et al. (2016) extract heatmaps by CNN based on pose estimation method to describe the gait in each frame and then adopt LSTM to model temporal information. Similarly, Liao et al. (2017) combines LSTM and CNN to capture the dynamic and static features of the input gait sequence. In addition,  and Liao et al. (2020) exploit human 3D pose estimated from images as the input for gait recognition.
Recently, Graph Convolutional Networks have extended conventional convolution (Lu, 2021;Srivastava & Biswas, 2020) to graph-structured data, and have been proven applicable in many computer vision tasks, including image classification (S. Zhang et al., 2021), semi-supervised learning (Kipf & Welling, 2016), and action recognition (Yan et al., 2018). For gait recognition, Li et al. (2020) firstly applies GCN to recognise gait through human 2D joints, and obtains remarkable performance despite the low dimensional feature. Shopon et al. (2021) integrates residual connections within GCN to enhance the performance by amplifying the high-level feature map of the deeper layer by adding the low-level feature map. However, they input the information of superfluous joints and do not exploit the bone and motion data, which deserve further improvement.

Attention Mechanism
The core idea of the attention mechanism is biasing the allocation of available computational resources towards the most informative components, which has successfully been applied in some fields, such as text classification (S. Zhang et al., 2021), localisation and understanding in images (C. Cao et al., 2015), and time-series data prediction (S. Wu et al., 2021). Squeeze-and-Excitation blocks (Hu et al., 2018) follow a soft-attention strategy to adaptively recalibrate channel-wise feature responses by explicitly modelling channel relationship. It has been demonstrated that they can be integrated with modern architectures to improve the representational power. Based on its marvellous utility, we leverage the Squeeze-and-Excitation blocks, making use of the correlation and importance between channels to further enhance the learning effect of the proposed model.

Proposed method
We propose a novel skeleton-based gait recognition method MS-Gait in this section, which extracts multi-order discriminative features from human skeleton data. The overall pipeline of MS-Gait is illustrated in Figure 1. Taking human skeletons estimated from the RGB frames as input, spatial-temporal graph convolutional networks (without SE blocks for Joint-stream and with SE blocks for Bone-stream and Motion-stream) learn effective gait features under the guidance of the Cross-entropy loss function. Then we deploy another pathway where the graph is divided into two subgraphs to capture part-level spatio-temporal gait features. Finally, the features of the multi streams (joint, bone, motion) are concatenated to obtain the fused score for recognition. We will introduce implementation details in the following parts.

Joints Selection.
Typically, the first step of skeleton-based methods for gait recognition is to estimate and preprocess the raw skeleton data from the given videos. Skeleton data estimated from a video sequence is adopted to construct spatial-temporal graphs. The graphs and subgraphs are then feed into graph convolutional networks to extract both spatial and temporal features. Cross-entropy loss is used to optimise the network. (b) Illustration of the overall architecture of the multi-stream network. The joint information, bone information, and motion information are fed into the network simultaneously. The feature embedding of all streams are concatenated for final recognition.
according to its official documentation, the model BODY_25 which can estimate 25 joints (Figure 2, mid) is much more accurate than COCO. Since the accuracy of the pose estimation has a decisive influence on gait recognition, we choose to exploit the model BODY_25 of openpose to locate 25 joints in the human body, namely Nose, Neck, RShoulder, RElbow, RWrist, LShoulder, LElbow, LWrist, MidHip, RHip, RKnee, RAnkle, LHip, LKnee, LAnkle, REye, LEye, REar, LEar, LBigToe, LSmallToe, LHeel, RBigToe, RSmallToe and RHeel. Significantly, the state of some joints, such as eyes and ears, changes minimally during walking. These joints not only cannot effectively boost the performance, but even reduce the accuracy of gait recognition. With this in mind, we just handle the first 15 joints (Figure 2, right), which contain a wealth of gait information already.

Graph Construction.
For gait recognition, the raw skeleton data obtained by pose estimation algorithms from videos are usually a sequence of joint coordinates. Following the work of ST-GCN (Yan et al., 2018), we construct a spatial-temporal graph, where  nodes represent human joints and edges represent natural connections along both the spatial and temporal dimension. As shown in Figure 3 (left), each node contains the location of the corresponding body part, the coordinates (x, y) of which have been normalised to the range [0, 1] (the blue vertexes). In spatial dimension, two joints with physical dependency in one frame are connected (the blue lines in Figure 3, left). In the temporal dimension, two same joints between consecutive frames are connected (the green lines in Figure 3, left).

Spatial Graph Convolution.
In ST-GCN (Yan et al., 2018), spatial graph convolution and temporal convolution are operated alternately. Among them, the spatial graph convolution is the core component, which brings in weighted average of adjacent features for each joint. The graph convolution on vertex v i in the spatial dimension is: where v represents the vertex and f represents the feature map.
The weighting function w provides a weight vector based on the given input. The mapping function l i maps each vertex with a unique weight vector. Z i (v j ) denotes the cardinality of the corresponding group, which is aim to balance the contributions of each group to the output. In this work, the feature map is defined as a tensor in (C, V, T) dimensions, where C, V, T denote the number of channels, joints and frames, respectively. Let A k ∈ {0, 1} N×N be the k-adjacency matrix of the joint graph.
Notably, A i,i = 1 for all vertexes. The implementation of the graph convolution in the spatial dimension is formulated as: where M is a learnable weight matrix to give minor edge corrections in skeleton graph.
k is the normalised diagonal degree matrix where ii k = j A ii k . W k denotes the weight matrix that contains the weight vectors of multiple output channels and σ (·) is an activation function.

Temporal Convolution.
The implementation of the temporal convolution is relatively simple. We directly perform the standard 2D convolution operation with 1 × K kernel on the temporal dimension to learn temporal information. Following the work in Z. , we adopt a bottleneck architecture with fixed kernel size and different dilation rates for multi-scale learning with larger receptive fields.

Unified Spatial-Temporal Modelling.
In addition to using such factorised modules to capture spatial and temporal features separately, we follow the design named G3D (Z. , a unified spatial-temporal graph convolutional operator, to learn complex spatial-temporal joint relations. With a sliding temporal window of size τ and a dilation rate d to construct the spatial-temporal graph with τ N nodes, the feature map expands to a tensor in (C, τ V, T) dimensions with zero padding. The k-adjacency matrix can be easily obtained by tiling A k as By coupling the above definitions, the implementation of the unified spatial-temporal convolution is derived as Finally, the G3D pathway and the factorised pathway are combined into the STGC block. The architecture is shown in Figure 4.

Squeeze-and-Excitation Block.
Spatial-temporal graph convolutional network aggregates spatial feature and temporal feature on the local receptive field. In order to further improve the performance of the model, attention mechanism is introduced in this section. We consider leveraging the correlation between channels and try to obtain the weight of each channel through learning, so as to selectively enhance useful features and suppress secondary features. Specifically, the SE block (Hu et al., 2018) is added on the basis of the STGC block to explicitly model the interdependence between feature channels, which mainly includes three steps. The first is the squeeze step, which exploits global average pooling (GAP) to compress global spatial features into a channel descriptor. After squeeze, the output represents the global distribution of the response on the characteristic channel, so that the global receptive field can be obtained at the input. The second is the excitation step. Similar to the gate mechanism in the Recurrent Neural Network (RNN), this operator generates weight parameters for each input-specific descriptor to learn a nonlinear interaction between channels. The third is the reweight step. Since the output weights reflect the importance of each channel, the Scale layer is applied to complete the recalibration of the original features in the channel dimension through the channel-wise multiplication between the weight vector and the feature map. The concrete structure of the STGC block that combine the SE block with the original module is depicted in Figure 5.
Through extensive experiments, we find that SE-STGCs bring significant improvements when handling the Bone-stream and Motion-stream, while the original STGC blocks perform much better in Joint-stream. The reason for this lies in that compared with the joint data, the bone and motion data contain more information since we replenish empty frames with repeated sequences for bone and motion data but with zero padding for joint data. When the attention mechanism tries to learn to deal with the deviation, it will reduce the accuracy in some states to a certain extent. Thus, in order to acquire prominent accuracy in various scenarios, we combine the advantages of STGC and SE-STGC, performing STGC for gait recognition in Joint-stream and SE-STGC in Bone-stream and Motion-stream.

Loss Function.
For a given query gait sequence and a set of candidate ones, the goal of the network is to compute the similarity between the query and each candidate gait sequence and exploit it to rank the entire candidates in the hope that the correct gait sequence can be retrieved at the top ranks. To implement the intent, we perform Cross-entropy loss to optimise our network to extract the discriminative spatiotemporal gait features. The Cross-entropy Loss is typically used in classification tasks (Yan et al., 2018;Ying et al., 2021), which can effectively pull apart the embeddings of different categories by guiding the network to classify the input into correct classes. It is defined as: where n indicates the total number of samples, y i indicates the actual label, and p i indicates the predicted output.

Part-fused GCN
At present, current GCN-based gait recognition methods regard the entire human body as a unit, and therefore capture walking patterns from the entire skeleton graph. Actually, walking can be considered as the co-movement of body parts, namely arms and legs, which are the smallest unit of the walking action. That means that the movement of body parts contains discriminative information that can be exploited for gait recognition. However, it is hard for GCN to automatically capture such detailed structure from skeleton data. For that reason, we focus on the modelling of body parts, and propose a part-fused framework to explore the importance of each body part and its interaction across spatial and temporal domains.
In particular, we divide the graph of the skeleton into two subgraphs, scilicet the upper limb and lower limbs (Figure 3, right), to learn the attributes of each body part. As shown in Figure 1(a), we first perform graph convolution operation over the subgraphs to capture high-level properties of body parts. During this process, information will be disseminated in the upper and lower extremities separately. Then we employ global average pooling within all vertexes to aggregate the above results. Finally, to fuse part-level feature, we concatenate the derived embedding captured from the entire graph and the embedding captured from subgraphs as the final feature for gait recognition.

Multi-stream networks
Existing skeleton-based gait recognition methods Li et al., 2020;Liao et al., 2017Liao et al., , 2020) typically leverage the 2D or 3D spatial coordinates of joints but neglect the bone and motion information, which are equally vital for gait recognition. As known, the bone data indicates the length and direction information between joints, and the motion data indicates the movement and speed information of joints between frames. For better exploitation of these data, a multi-stream feature extraction strategy is proposed in this section to promote the gait recognition task.
Since the bones are the bonds that link two joints, we define the bone data as a vector pointing from the source joint to the target joint. Specifically, the bone between the source . Note that each joint has a unique source joint, while the nose joint points from itself. Figure 6 depicts the pointing relationship, where the start and end of the arrow denote the source and target joint of the bones.
Obviously, the motion information of the joint refers to the coordinate change of the joint between adjacent frames. Given the coordinate of joint i in frame t v i,t = (x i,t , y i,t ), the motion data can be easily generated by the formula As shown in Figure 1(b), we use Joint-stream, Bone-stream and Motion-stream to represent the networks which, respectively, receive joint, bone and motion as input. Given skeleton data, bone and motion are firstly calculated and fed into Joint-stream, Bone-stream and Motion-stream simultaneously. Then the output feature maps of the three streams are summed up to fuse all the information for recognition. Experiments in Section 4.2 have proved that joint, bone, and motion information are discriminative and complementary, and combining them can lead to the improvement of recognition performance. Dataset and experimental setting   4.1.0.8. CASIA-B. We perform several empirical experiments on the popular gait database CASIA-B (Yu et al., 2006) to evaluate the performance of the proposed MS-Gait method. CASIA-B, containing RGB videos gathered from 124 subjects in total (31 females and 93 males), is one of the largest public gait datasets for gait recognition. For each subject, 10 gait sequences are provided, including 6 sequences in normal walking (NM) condition, 2 sequences in walking with bag (BG) condition and 2 sequences in walking with coat (CL) condition. For each sequence, the walking videos were captured from cameras at 11 different views simultaneously, namely {0 • , 18 • , 36 • , . . . , 180 • }. For evaluating the models, we follow the two popular experimental setting of GaitSet (Chao et al., 2019), namely medium-sample training (MT) and large-sample training (LT). In MT, the first 64 subjects are separated into the training set and the rest 64 subjects belongs to the test set. In LT, the first 74 subjects are separated into the training set and the rest 50 subjects belongs to the test set. In the test set, the gallery set consists of the first 4 clips in NM walking condition (NM01-NM04) of each subject, and the probe set retains the rest (NM05-NM06, BG01-BG02, CL01-CL02). When testing, all the accuracies are averaged on the 11 gallery views where the identical views are excluded. For instance, the accuracy of probe view 0 • is averaged on 10 gallery views without gallery view 0 • .

Experimental Setting.
The model is composed of 3 STGC blocks. The dimensions of the output features of the first block are 96. The follow block output 192-dimension. And the last block output 384-dimension. The size of input data is 2 × 120 × 15, which means there have 120 frames per clip, 15 joints per frame, and 2 channels (horizontal and vertical coordinate) per joint. For gait sequences with less than 120 frames, we supplement empty frames with the value of 0 for joint data and repeat the sequences until it reaches 120 frames for bone and motion data. The furthest distance D of the neighbour node is 8 and the strides of the 2th and the 3th block layers are 2. Two G3D pathways are deployed with τ , d = {(3, 1), (5, 1)}. In SE block, we set r to be 16. The batch size is 128. The SGD algorithm is adopted with an initial learning rate 0.1, decaying with a factor of 0.1 after 45 and 55 epochs. We trained the model for about 65 epochs with 1 NVIDIA 2080TI GPU.

Ablation study
In order to examine each component of the network, we conduct extensive experiments on CASIA-B dataset in this section. Note that the experiments in this section all adopt largesample training (LT).

Effectiveness of SE Block.
Firstly, we focus on validating the effect of SE block on improving recognition performance. In the experiments, we consider two scenarios, namely each stream without SE and with SE. Table 1 shows the corresponding recognition accuracies. It can be seen that for Joint-stream, the network with SE block performs better when handling the NM condition while the original performs better in BG and CL condition. The intuition is that compared with NM condition, the skeleton data extracted by pose estimation methods is biased in BG and CL condition, which means more challenges to gait recognition. When the attention mechanism tries to learn to improve the accuracy, it will neglect the deviation to some extent. However, the networks with SE block obtain the highest recognition performance in Bone-stream and Motion-stream, which validates the effectiveness of our strategy. For superb accuracy in various scenarios, we adopt STGC for Joint-stream and SE-STGC for Bone-stream and Motion-stream in subsequent experiments.

Part-fused and Multi-Stream.
Here, we verify the effectiveness of the proposed part-fused framework by comparing the accuracies of each stream with and without partfused, as well as the effectiveness of the proposed multi-stream strategy by comparing the accuracies of single-stream and multi-stream. The average recognition rates are shown in Table 2. We can clearly see that for all the three streams, the performance has been improved to a certain extent in all conditions after fusing the part-level features. Also, the recognition accuracy of the proposed MS-Gait with part-fused is averagely higher 2.3% than that of MS-Gait without part-fused, which proves the effectiveness of the proposed part-fused framework. Additionally, the MS-Gait method that combines joint, bone and motion information outperforms the single-stream-based methods in all three conditions. Compared with the original architecture that only exploit the joint data, the accuracy significantly improved from 76.9% to 90.0% in NM, from 68.2% to 79.7% in BG, from 64.4% to 79.4% in CL, respectively. These results indicate that the three type of information are discriminative and complementary, and multi-stream that combines them can enhance the discriminative ability of gait features obviously. et al. (2021), we conduct ablation experiments with different loss functions. Notably, for the proposed model optimised by minimising the triplet loss, the margin is set to 0.5, the batch size is set to (8,16) and the model is trained for about 20 K iterations. The results are shown in Table 2. Whether for single-stream or multi-stream, although the training of triplet loss takes twice as long as cross-entropy loss, the proposed model obtained lower accuracy under all the three conditions. Thus, we finally adopt cross-entropy loss for the proposed approach.      Tables 3-5, the probe set contains gait sequences from NM05-NM06, BG01-BG02, and CL01-CL02 condition, respectively. In each table, the row corresponds to the 11 view angles of the gallery set, and the column corresponds to the 11 view angles of the probe set.

Comparisons with model-based methods
In this section, we compare the robustness between the proposed MS-Gait and three advanced model-based methods with the same experimental settings, namely PoseGait (Liao et al., 2020), JointsGait  and GaitGraph (Teepe et al., 2021). These methods all deal with 2D joints estimated by openpose (Z. Cao et al., 2019), in which PoseGait (Liao et al., 2020) transform the 2D pose into 3D additionally. The rates of the compared methods are directly cited from the original paper. The experimental results under NM, BG and CL condition are shown in Table 6. It can be clearly seen that MS-Gait not only achieves the best average accuracies, but also outperforms all the compared model-based methods on each probe view nearly. This clearly demonstrates that MS-Gait can effectively handle with various viewing angles and conditions variations.

Comparisons with appearance-based methods
Compared with appearance-based features, the features extracted by model-based methods are more compact, which means it is more challenging for model-based feature extraction. To prove the effectiveness of the model-based methods, here we compare the average recognition rates with view variation with some advanced appearance-based methods, namely CNN-LB (Z. Wu et al., 2016), GaitSet (Chao et al., 2019), and GaitPart (Fan et al., 2020). All methods use the same experimental settings with the first 74 subjects for training and remaining 50 for testing, regarding three variations with crossing viewing angles of 0 • to 180 • . The results are shown in Figure 7. In CL condition, the proposed method achieves a high average recognition rate of 79.4%, which outperforms all the appearance-based methods. It proves the effectiveness of the skeleton-based method for overcoat walking since the skeleton is much more robust than silhouette images. In NM and BG condition, the recognition rate of the proposed method is much higher than the appearance-based method CNN-LB (Z. Wu et al., 2016) and is competitive with GaitSet (Chao et al., 2019) and Gait-Part (Fan et al., 2020). The intuition is that we just exploit the skeleton data as gait feature while appearance-based methods exploit higher-dimensional features. However, we can see that the proposed MS-Gait has the least standard deviation, that is, different from the volatile performance of appearance-based methods, our model is more stable to changes of covariates, which is just consistent with the hypothesis that model-based approaches have greater robustness and adaptability. Consequently, it is reasonable to assume that our method may perform better in the case of a larger number of training set based on the robustness of human skeleton.

Conclusion
In this paper, we present a novel model-based gait recognition method, MS-Gait, to extract both spatial and temporal features from skeleton data. It integrates the SE blocks into the STGC block to learn more representational and more discriminative gait features and fuses fine-grained information extracted from human body parts. Also, a multi-stream feature extraction strategy is proposed to aggregate joint, bone, and motion information to promote the recognition performance effectively. Experiments on the popular CASIA-B dataset have shown that our method archives the highest recognition rates compared with other state-of-the-art model-based algorithms, and surpasses most appearance-based algorithms with view variation, which prove that MS-Gait can distill more discriminative information under various viewing angles and conditions. However, the proposed method still has certain limitations, including the inaccuracy of human body modelling, the missed use of 3D coordinates, no distinction of weights assigned to each joint and each frame, etc. In the future, the settlement of these limitations and the combination of gait features extracted from joints and from appearance need profound studies.

Disclosure statement
No potential conflict of interest was reported by the author(s).