A hybrid recognition model of microseismic signals for underground mining based on CNN and LSTM networks

Abstract Microseismic (MS) monitoring technology has been widely used to monitor ground pressure disasters. However, the underground mining environment is complex and contains many types of noise sources. Furthermore, the traditional recognition method entails a complex process with low recognition accuracy for MS signals, so it is difficult to serve for the safe production of mines. Therefore, this study established a hybrid model combining the singular spectrum analysis (SSA) method, convolutional neural networks (CNN), and long short-term memory networks (LSTM). First, the principal components of monitoring signals were extracted with the SSA method, and then spatial and temporal features of monitoring signals were separately extracted with the CNN and LSTM. Based on actual field data collected from Xiadian Gold Mine, the hybrid model was compared with the CNN, LSTM, and back-propagation networks (BP), as well as commonly used recognition methods including the support vector machine (SVM), decision tree (DT), K-nearest neighbor (KNN), and linear discriminant analysis (LDA). The results show that the proposed hybrid model can accurately extract data features of monitoring signals and further improve MS signals' recognition performance. Furthermore, the recognition accuracy of mechanical signals in monitoring signals is particularly increased using the hybrid model, which avoids confusion with MS signals.


Introduction
The deformation and failure of rock masses essentially mean the initiation, propagation, and interaction of fractures under engineering-induced disturbance, during which microseismic (MS) signals are released and can be synchronously recorded by multiple MS sensors distributed in the vicinity (Dai et al. 2017;Zhao et al. 2019). Performing inversion analysis on the MS signals containing rock failure information can rapidly acquire important data such as three factors (i.e., time, location, and fuzzy identification based on prior knowledge, and the self-adaptive ability of the identification model is poor. Therefore, it can only be applied to identifying MS signals under specific conditions with a low degree of intelligence. The above methods of MS signal recognition need to manually extract features from the original data based on signal processing and statistical theories. The disadvantage is that it needs a lot of engineering practice and information processing technology to extract signal features, which seriously depends on the professional knowledge of personnel . Furthermore, signals captured by the MS monitoring system in mines include MS and blasting signals and some mechanical signals with low frequency and amplitude (which are easily confused with MS signals). Therefore, the recognition of MS signals is a multi-recognition problem. Due to the complexity of the mine environment and the scattering and reflection effect of the rock, the energy, and frequency of monitoring signals are constantly decreasing, which is characterized by diversity and complexity (Xu et al. 2010). Therefore, the monitored signals are so massive that the traditional feature extraction and signal recognition methods are rendered inapplicable for the MS monitoring system used in the mine, which leads to the accurate identification of MS signals become a challenging topic.
Although many scholars combined data statistics technology, signal processing technology, and neural networks technology to realize the preliminary identification of MS signals on the computer, the final identification result is still based on the later artificial intervention. In the early stage of the automatic recognition model, the feature parameters need to be provided manually. Furthermore, the recognition criterion is single, which significantly affects the accuracy and reliability of recognition, and has no deep learning ability. In recent years, deep learning technology provides an effective tool to solve the above problems.
At present, due to the excellent feature extraction ability of deep learning technology than traditional methods, it has achieved great success in computer vision, machine translation, signal processing, and other fields (Lv et al. 2019;Singh et al. 2021). However, the research in mine MS monitoring is still in its infancy (Dong et al. 2020). At present, the feature learning ability of the convolutional neural network (CNN) has been recognized and rapidly developed with the advances made possible by modern computers and big-data techniques (Gu et al. 2018). It has made breakthrough achievements in image recognition (Alex et al. 2012), speech recognition (Abdel-Hamid et al. 2014), and heartbeat and brain waves fields (Kiranyaz et al. 2016) while remaining rare in applications to MS monitoring in mines. However, the information in the CNN is transferred unidirectionally, and there is no connection between the nodes in each layer, so it is impossible to simulate time-dependent relationships therewith. Furthermore, due to the complexity of MS monitoring signals, the features of the original time series of signals are generally ignored when extracting monitoring signals only using CNN. As a result, the CNN can only extract the local spatial features of the monitoring signal but not the time-domain features.
A recurrent neural networks (RNN) model-long short-term memory networks (LSTM) has a cyclic network structure and the ability to memorize historical information. That is to say; the LSTM can extract the time domain characteristics of time-series signals. At present, the LSTM has been widely used for the analysis of time-series signals, such as those required for traffic speed prediction , power load forecasting (Jian et al. 2017;Kong et al. 2019) and speech recognition (Graves and Jaitly 2014). However, in the mine MS monitoring field, scholars have done little research on the LSTM in MS signal recognition. As a result, they have not effectively combined the advantage of LSTM with the time-series feature of the MS signal. However, the disadvantage of the LSTM is that they can only extract the temporal features of signals but not the local spatial features.
Given the spatial and temporal features of monitoring signals, it is hard to extract the two types of features in a single network structure simultaneously. Hence, a new hybrid networks model (SSA-CNN-LSTM) for identifying all kinds of monitoring signals generated during mining based on singular spectrum analysis (SSA) (for extracting valid signal components), CNN (for extracting spatial feature), and LSTM (for extracting temporal feature) is proposed. Considering that the monitoring signal is usually the superposition of multiple signals, the principal component of the monitoring signal is extracted by SSA before extracting the spatio-temporal features of the monitoring signal, thus enhancing the distinction between different types of monitoring signals. Then, combining the advantages of CNN and LSTM, the former is used to extract the local spatial characteristics of the signal, and the latter is used to extract the temporal characteristics of the signal. Based on the experimental data of MS monitoring signals recorded in a test stope in the Xiadian Gold Mine (Shandong Province, China), the accuracy of the SSA-CNN-LSTM at recognizing signals was verified. The study aims to improve the recognition performance of MS signals, providing a new idea to recognize and study MS monitoring signals in the mine.

Engineering background
The Xiadian Gold Mine located in Zhaoyuan City, the 'gold capital' of China, is one of ten mines producing gold in China. The vein-like and quasi-lamellar main orebody of the mine occurs at elevations in the range of -600 to -1470 m and undulates gently and   wavily along with the strike and dip (left panel, Figure 1). The Xiadian Gold Mine, with a current mining depth of 800 m, typifies deep mining operations in China. The underground mining environment in the mining area is adverse, especially under the longterm effect of high in situ stress. Much elastic energy accumulates in the surrounding rocks, which poses a risk of severe ground pressure behavior. At present, to achieve the economic and safety goals of efficient mining, a test stope (lower right panel, Figure 1) was set at a sub-level of -700 m in the mine to perform optimization testing of the stope structure. In addition, the IMS MS monitoring system commonly used in mines was distributed near the test stope to monitor the stability of the test stope. Based on the prevailing conditions on-site, six MS sensors (five single-component geophones and one three-component geophone with bandwidth of 9 to 2000 Hz and sensitivity of 80 V/m/s ± 10%) with 6000 Hz sampling frequency were distributed near the sub-levels of -692 m and -700 m in the test stope, guaranteeing favorable location accuracy of MS events. Four and two MS sensors were separately distributed at sub-levels of -692 m and -700 m, respectively, as shown in the right panel in Figure 1. MS sensors were densely arranged in the range of 4109,813.06 m to 4109,931.55 m in the north and 529,676.57 m to 529,753.62 m in the East.

Data source and its characteristics
According to the blasting time and place recorded on-site, the blasting signals are manually selected as templates to identify all monitored signals to determine the blasting signal manually. Next, judge the signals obtained when the mine has no blasting, and there are few mechanical signals, and manually screen out the MS signals. These signals are used as references to identify all monitored signals to determine the MS signal manually. Finally, the signals monitored during the operation of mechanical equipment are analyzed and based on this, manual recognition of all monitored signals is made to determine mechanical signals. As a result, 7164 signals were collected from the MS monitoring system (from December 11, 2015, to February 29, 2016 before, during, and after mining in Xiadian Gold Mine, including 2765 MS signals, 2381 blasting signals, and 2018 mechanical signals. It is pointed out in (Hensman and Masko 2015) that the balanced distribution yielded the best performance. Therefore, a signal database containing three types of signals is established manually by carefully analyzing the monitoring signal. This signal database contains 2000 MS signals, 2000 blasting signals, and 2000 mechanical signals. The three common types of signals recorded in the MS monitoring system are shown in Figure 2. Figure 3 shows the statistical results of dominant frequencies, number of peaks, and maximum amplitude of the three common types of signals.
As shown in Figures 2 and 3: Figure 2(a). This kind of signal is released by rock failure, showing slow signal attenuation and developed coda. The amplitude is lower than that of blasting signals and mechanical signals, and the dominant frequency is low, generally below 200 Hz. In addition, this kind of MS signal usually contains one peak. Figure 2(b). Since metal mines are mostly excavated and mined through blasting, many blasting signals are recorded in monitoring signals, with higher amplitude and wider dominant frequency range than MS signals. In addition, they have the characteristics of high amplitude, fast attenuation, and multiple peaks. 3. A typical mechanical signal is shown in Figure 2(c). There are many mechanical equipment operations during mining in the underground mine, such as fans and drilling rigs, mine car transportation, truck transportation, and ore-drawing. Such signals have a long duration, slower attenuation, and wider dominant frequency range than MS signals. However, its developed coda is similar to MS signals. Moreover, because underground mechanical equipment produces many mechanical signals, and their signals are complex and diverse, it is difficult to distinguish them from MS signals.

A typical blasting signal is shown in
There are many similarities among the above three types of signals ( Figure 3), so it is challenging to recognize MS signals directly from amplitude or dominant frequency alone. In addition, due to the complex transmission paths of MS signals, phenomena such as reflection, diffraction, and attenuation will occur in the transmission of signals in geologic media. Moreover, signals also interfere with each other, which determines the complexity of signals and increases the difficulty of signal identification.

A new signal recognition model-SSA-CNN-LSTM
In this section, with the help of the SSA method, CNN, and LSTM, a new hybrid signal recognition model (SSA-CNN-LSTM) is proposed, which is the core content of this article. The basis of selecting the SSA method is that this method can extract valid components of monitoring signals and weaken background noise, thus improving the discrimination of the signals (Zhigljavsky 2010). The monitoring signal processed by the SSA method forms an RGB image through the short-time Fourier transform (STFT) method, which is used as the input of the proposed hybrid model. The basis of selecting CNN is that CNN has a strong learning ability for image features, which can extract spatial features of monitoring signals. The basis of selecting the LSTM is that LSTM can memorize the historical information and thus shows the unique superiority in processing the sequence signals.

SSA method
Using the SSA method, the data space is projected into sub-spaces with different features, and singular values characterize the nature of these sub-spaces. This method realizes the function of extracting principal components of signals based on the reduced-rank principle (Vautard and Ghil 1989) and can recognize different frequencies of signals and rank the signals by signal energy. It mainly includes three parts: calculation of trajectory matrix, singular value decomposition of the matrix, and reconstruction of principal components of signals.
The deduction behind the method is described as follows (Vautard and Ghil 1989;Groth and Ghil 2015): A one-dimensional time-series monitoring signal is assumed to be x ¼ ½x 1 , x 2 , :::, x N : To understand its implicit construct of temporal evolution, the monitoring signals are arranged with a time lag.
where, time lag M refers to window length or embedding dimension. The embedding dimension M is equivalent to the resolution when decomposing components of monitoring signals. The above matrix X is named the trajectory matrix. The trajectory matrix is a Hankel matrix, and all elements on the anti-diagonal are equal. Its autocovariance matrix is expressed as follows: (2) where, T X represents a Toeplitz matrix, in which elements on the main diagonal are equal, as are the elements on the line parallel to the main diagonal. It is widely applied in digital signal processing. The reason for constructing the covariance matrix is that it can judge the correlation of data. The larger the singular value of the covariance matrix, the closer the original series is distributed to the eigenvector corresponding to the singular value. Therefore, the waveform reconstructed by the eigenvector can represent the main features of the original series. cðjÞ denotes the serial estimation of time series signal x, which can be calculated by Yule-Walke estimation.
x t x tþj , ðj ¼ 0, 1, 2, :::, MÀ1Þ ( By performing singular value decomposition on T X , the singular value vector k and the corresponding eigenvector matrix E can be obtained, expressed as follows: where, the singular value meets k 1 The singular value vector r of T X is given by: where, r is called the singular spectrum of MS signal x, and singular spectrum analysis is to calculate its singular value. The eigenvector corresponding to k k is the k thorder mode, and each mode represents different change trends of the signal. The larger the singular value, the higher the signal component amplitude and the greater the energy therein; a smaller singular value corresponds to the noise component in the signal. The eigenvector E k corresponding to k k is known as the time empirical orthogonal function (TEOF) and the k th time principal component (TPC) is defined as the orthogonal projection coefficient of the original signal series x i on E k : Through matrix operations of the trajectory matrix X and eigenvector E, the weight matrix A of the eigenvector matrix E in matrix X can be calculated as follows: where, a series formed by any M components of the TEOF reflects temporal evolution of the signal series x i ; a i, k represents the k th TPC, which is the weight on the time represented by E j, k in the period of original series x iþ1 , x iþ2 , … , x iþm : An important function of the SSA method is the reconstruction component (RC) of signals. The so-called reconstruction reconstructs the original signal according to demands through matrix operations on the weight matrix A and the eigenvector matrix E. In other words, an M-series x _ ð1 i MÞ with different components of monitoring signal x can be reconstructed in accordance with demands based on TPC and TEOF, which realizes the function of extracting the principal components of the original signals. x where, the k th RC k is calculated as follows: RC has the property of superposition and the original signal series x ¼x 1 þx 2 þ Á Á Á þx M can be obtained by summing all RCs. Based on demands, the first k RCs with a large contribution are intercepted, so the approximationx of the original signal can be expressed as follows:x Through the SSA, the effective components of the signal are aggregated into the first k RCs. The reliable information is extracted from monitoring signals containing noise as much as possible, thus reducing the influence of noise.
To measure the energy of RCs in the original MS signal, the energy contribution is defined as: where, x 2 RC K i represents the square of the amplitude of the ith sampling point of the RC K; x 2 i represents the square of the amplitude of the ith sampling point in the original MS waveform.

CNN model
The commonly used deep learning neural networks can be roughly divided into a feed-forward neural network (FNN) and RNN. CNN is one of the most representative models in the FNN. The CNN involves constructing a model of machine learning architecture with multiple hidden layers. The features of signals are transformed layer-by-layer in more levels to abstract and generate high-level features, which is more conducive to recognition or prediction (Bengio 2009). Furthermore, each node in CNN is locally connected, and the connection weights of some neurons in the same layer are shared, with fewer weights. Thus, it can significantly reduce the number of parameters required for training (Sokolic et al. 2017) and bring revolutionary benefits when processing big-data problems, suitable for recognizing MS signals in a mine with an inherently complex environment.
From the structure and training process of a CNN, the network is briefly described: the typical network structure of CNN is mainly composed of an input layer, convolution layers, pooling layers, fully connected layers, and a recognition layer (Figure 4). The convolution and pooling layers are adopted to extract features, while the fully connected and recognition layers are used for final recognition.

Convolution layer
A convolution layer usually contains several feature planes, each composed of multiple neurons. Each neuron is obtained by convoluting the given convolution kernel, moving over a certain step with the local region of the feature plane in the upper layer (called local receptive field). The convolution kernel is a weight matrix, usually a 3 Â 3 or 5 Â 5 matrix in two-dimensional (2D) images (Gao et al. 2016). These convolution kernel functions act on the input images to extract local features therein. The convolution operation is shown in Figure 5. Different convolution kernels can extract different features, such as edges, lines, and corners of images, and the higher-level the convolution layer, the higher the level of feature extraction (Hinton 2010;Dahl et al. 2013). A new feature map can be generated by inputting the convolution result of the feature maps in the upper layer into the activation function.

Pooling layer
The pooling layer lies below the convolution layer, and its function is to reduce the dimension of the feature maps based on retaining useful information and improving the calculation speed of the network. The combination of a convolution layer and a pooling layer is called a feature extraction process. The commonly seen pooling operations are averaging and maximum pooling. After pooling, the resolution of the feature map decreases, but the features described by the high-resolution feature maps are maintained (Simard et al. 2003).

Fully connected layer
Below multiple convolution layers and pooling layers, one or more fully connected layers are established. The purpose of the fully connected layer is to integrate the highly abstract features obtained from the convolution layer or pooling layer and then normalize them. Each neuron in the fully connected layer is connected to all neurons in the upper layer. In the pooling layer, the vectors of the feature map are a 2D array, which needs to be converted into 1D vectors, and then these 1D vectors are connected to form the input of the fully connected layer. Thus, the mathematical expression of a single neuron in the fully connected layer is given by: where, Q represents the output of the obtained fully connected layer; f and b separately indicate the activation function and bias value; W T denotes the weight to be trained; X denotes the eigenvector of the upper layer (pooling layer).

Regression layer
The output value of the last fully connected layer is transferred to the regression layer to realize recognition or regression calculation. The implementation process is shown in Figure 6. The part to the left of the equals sign in the figure is the work of the fully connected layer, which is the mathematical meaning described in Eq. (13). The matrix W is a parameter of the fully connected layer, an M Â N matrix (where M is the number of classes). When training the network, the fully connected layer is used to find the most appropriate weight matrix W. As described in Eq. (13), the output of the fully connected layer is an M Â 1 vector, i.e., Logits in Figure 6. For the multi-recognition problem, the fully connected layer is followed by a Softmax layer. The idea for realizing the Softmax layer is as follows: for an input vector x, the conditional probability that the vector belongs to the jth class is first calculated, and then normalized so that the output probability lies on the range from 0 to 1, that is, Prob in Figure 6. Each value of such a vector represents the probability that the input sample belongs to each class. The output with the highest probability corresponds to the obtained class.

Training process
The training of CNN includes two processes, namely forward propagation and backward propagation. The above description of the structures of CNN follows the order of forwarding propagation. In the forwarding propagation, the signal input X in the first layer forms the output O of the last layer after multi-layer convolution, pooling, and feature integration in fully connected layers. The error C is calculated by comparing the output with the expected tag T, which is also known as the loss function, defined as follows: where, n is the number of the samples, and the loss of all training samples is averaged; x and y(x) indicate the input of neutral network and predicted value corresponding to the input x, respectively; a represents the given tag.
In the training process, when the recognition result is inconsistent with expectations (that is, the final value is inconsistent with the expected value), the backpropagation (Girshick 2015) is conducted to calculate the error between the result and expected value. The backward paths of the neural network are traversed, and the error is back-transferred to each node on a layer-by-layer basis to calculate errors in each layer. Thereafter, the weight is updated. Based on the gradient of the least descent method, the weight and bias of the network are adjusted. Finally, the training ends when the loss function in the training samples is less than a specific threshold value. The final weight and bias are used for prediction and recognition.

LSTM model
LSTM model is composed of several typical LSTM units in series, and the structure of the LSTM unit is shown in Figure 7. The LSTM unit attempts to add or eliminate information for neurons by introducing the concept of a 'gate' that is (Graves 2012), controlling the input and output of information to realize the memory function. The LSTM unit contains three gates (input gate, output gate, and forget gate), which are used to protect and control the state of the storage cell C.
The forget gate aims to control how much information in the storage cell C tÀ1 at the previous moment is to be kept in the storage cell C t at the current moment. The input gate mainly controls how much information on the input X t at the current moment is to be stored in the storage cell C t : The output gate is mainly used to control how much information in the storage cell C t at a certain moment is passed to the hidden state h t of the cell. In a forget gate, the forgotten part of a state memory cell is commonly determined by input X t , state memory cell C tÀ1 , and intermediate output h tÀ1 : X t in an input gate transformed by sigmoid and tanh functions commonly determines the retaining vector in the state memory cell. The intermediate output h t is jointly determined by updated C t and output O t : The above working principle is exemplified by the following equations: where, f t , i t , g t , O t , C t , and h t separately represent the forget gate, input gate, input node, output gate, and memory cell and state of immediate output; W fx , W fh , W ix , W ih , W gx , W gh , W ox , and W oh indicate the matrix weights of the corresponding gate multiplied by the input X t and intermediate output h tÀ1 , respectively; b f , b i , b g , and b o denote the bias terms of the corresponding gates; stands for that elements in the vector are subjected to bit-wise multiplication; r and / refer to the sigmoid and tanh functions, respectively. In the LSTM model, several LSTM units receive the characteristic input of a monitoring signal at different moments. Through the above operation of three gates, the historical information of the time series can be recorded. At each moment, the attributes at the current and previous moments are recorded. Such a feature confers an important advantage when recognizing MS monitoring signals. In learning and training, the LSTM adopts the same back-propagation method for errors as used in CNN.

SSA-CNN-LSTM hybrid model
CNN and LSTM are the main algorithms used in deep learning, but they have advantages in dealing with different data types. CNN can extract high-level features with strong representational ability in spatial dimensions. On the other hand, LSTM can extend temporal features and is good at processing data from monitoring signals with sequence characteristics. The monitoring signal is transformed into a time-frequency map and input to the network in images. In the process, the spatial feature information and that pertaining to the time dimension should be considered. Based on the above features, in combination with the SSA method, CNN, and LSTM, the SSA-CNN-LSTM hybrid model was established by network series connection, which makes full use of the representational abilities of the CNN and LSTM for spatial and temporal features.
The structure of the SSA-CNN-LSTM hybrid model is shown in Figure 8. The principal components of monitoring signals are extracted by the SSA method. The essential parameters of the signals are obtained by STFT and converted into RGB images as input to the SSA-CNN-LSTM hybrid model. Then, they will be transferred in the convolution layer and pooling layer of CNN and hidden layer of LSTM networks to obtain an optimal feature representation. Finally, the monitoring signals are classified into corresponding classes using Softmax, a multi-class non-linear activation function in the fully connected layer.

Implementation process of the SSA-CNN-LSTM
The detailed implementation steps of the SSA-CNN-LSTM hybrid model are summarized as follows, and the flowchart is shown in Figure 9: (1) Data pre-processing The flowchart is shown in Figure 9(a): At first, all signals in the signal database are pre-processed to ensure three types of signals (MS, blasting, and mechanical signals) have the same length by upper and lower sampling methods. Afterwards, the corresponding labels, MS signal label, blasting signal label, or mechanical signal label, are assigned to the monitoring signals in the signal database. The 70% data are stochastically taken from the monitoring signal database as the training samples, and the other 30% data are as the validation samples.
(2) Extract RCs of the signals by the SSA method The flowchart is shown in Figure 9(b): At first, the method proposed by Cao (1997) is used to determine an appropriate embedding dimension. Subsequently, a trajectory matrix is constructed; the autocovariance matrix is solved, and the singular value decomposition is performed. Afterwards, save the RC 1 of all signals from the training samples.
To determine the number K of RCs, 30 MS signals were randomly selected from the signal database for analysis. Figure 10 shows the SSA analysis process of a randomly selected signal among the 30 MS signals. Figure 10(b) shows the first five principal RCs: the amplitude of the RC 1 is high, at about ten times that of RC 2. The MS signal amplitudes gradually reduce from RC 2 to RC 5, in which the amplitude of RC 1 is about 1000 times that of RC 5 (RC 3 to RC 5 contain significant noise). Figure 10(c) shows the energy ratio of RCs to the original MS signal. According to the principle that the energy of a valid signal component is higher than that of noise in MS signals, it can be thought that RC 1 is the valid component of the MS signal, and the signals are reconstructed from the rest of the singular values are noise. By comparing RC 1 with the original MS signal (Figure 10(d)), valid MS signal components are retained in RC 1. The energy ratio of RCs to the original MS signal of the 30 MS signals is solved, as shown in Figure 11. Figure 11 shows that the energy ratios of the RC 1 of the 30 random MS signals are far larger than those of the other RCs; that is, the valid components of signals are concentrated on RC 1. Therefore, in this step, the K is taken as 1 to meet the requirements.
(3) Build the signal recognition hybrid networks The flowchart is shown in Figure 9(c): RC 1 of all signals of the training samples is subjected to the STFT method for attaining the time-frequency map, which contains time, frequency, and amplitude characteristics of signals. This time-frequency map is used as the input of the hybrid networks.
For example, Figure 12 shows the STFT results of MS, blasting, and mechanical signals (Figure 2) before and after extracting effective components through the SSA method: the time-frequency characteristics of the three signals are largely different. Specifically, the frequency of MS signals is lower and more concentrated, and the signals attenuate slowly. The time-frequency maps of mechanical and MS signals demonstrate that their frequencies are concentrated within the same range and have similar amplitudes; however, their time-frequency maps show that mechanical signals attenuate more slowly than MS signals. After the signals are processed by the SSA method, the signals can be better distinguished. Using the STFT program, the time, frequency, and amplitude characteristics of signals can be considered by STFT without the need for multi-parameter extraction, as is the case when using conventional methods.
Design the hybrid model structure The networks structure of SSA-CNN-LSTM is complex, so the optimal results can only be found through trial and error. Several trials with different structures were conducted, and the final SSA-CNN-LSTM topological structure is shown in Figure 13 and Table 1.
The first layer of the SSA-CNN-LSTM is the input layer, whose input is the time-frequency RGB image generated in the above steps. Use a sequence folding layer to perform convolution operations on time steps of image sequences independently. There are five convolution layers from the third to the seventh layer in the network structure, coupled with the maximum pooling layer. Fc 6 and Fc 7 in Table 1 are fully connected layers. An unfolding layer and a flatten layer are connected behind them for restoring the sequence structure and reshape the output to vector sequences. Then, the reshape vector sequence is input into the LSTM layer. In the final stage of the hybrid networks model, one fully connected layer and a Softmax layer are used for signal recognition and output. In addition, three dropout operations (take 0.5) are used in the whole hybrid networks to reduce the possibility of overfitting and improve the model's generalization ability.   There are many hyperparameters to be tested in such a deep neural network, including the learning factor a, factor b of gradient descent with momentum, parameters (b 1 , b 2 , and e) of the Adam optimization algorithm, descent parameters of learning factor, and the number of samples contained in batch training samples. In this study, the learning factor a was set to 0.001 and the three parameters b 1 , b 2 , and e in the Adam algorithm were set to 0.9, 0.999, and 10 À6 , respectively.
(4) Test the recognition performance The flowchart is shown in Figure 9(d): The validation samples are input; the signal recognition is performed with the aid of the SSA-CNN-LSTM hybrid model trained through the aforementioned steps, and then the result is compared with that obtained through manual signal recognition.
The threshold e 1 of the signal recognition accuracy is set (90% in this article). That is, if the accuracy of the signal recognition is more than 90%, then the SSA-CNN-LSTM hybrid model satisfies the requirement and is preserved; otherwise, the model is further trained with increasing training samples until it satisfies the requirement.

Signal recognition performance using conventional recognition methods
This section aims to test the recognition performance of the conventional recognition methods based on characteristic parameters of signals for the three types of signals and compare these with the SSA-CNN-LSTM to highlight the necessity for this article. By calculating parameters of the three types of signals in the signal database many times and referring to the results of references (Dong et al. 2016b;Li et al. 2018), four parameters were determined as characteristic parameters for signal recognition. These include the number of signal peaks (reflects the number of signals released in a short time), the dominant frequency (reflects the essential characteristics of the signal), and the maximum amplitude (reflects the intensity of signal), and the rise time (reflects the energy rate of the signal in a short time, which is related to rock failure). The rise time is defined as the time from P-wave onset to the maximum amplitude of the MS signal. The modified STA/LTA method (https://ww2.mathworks.cn/matlabcentral/fileexchange/51996-suspension-bridge-picking-algorithm-sbpx?s_tid=srchtitle.) is used for picking the P-wave onset. Figures 3 and 14 show the distribution of four characteristic parameters of the three types of signals. As demonstrated in Figure 3(a), MS signals with a small dominant frequency are easily distinguished from blasting signals. However, MS and mechanical signals overlap in low-frequency band, so it is not easy to distinguish them based on this parameter. It can be seen from Figure 3(b) that MS signals mainly have one peak. On the other hand, the mechanical signals and blasting signals have multi-peak characteristics. Therefore, the number of peaks can distinguish MS signals from blasting and mechanical signals, but it is not easy to distinguish blasting signals from mechanical signals. As displayed in Figure 3(c), for the maximum amplitude, MS and mechanical signals can be easily distinguished from blasting signals. In contrast, MS signals and mechanical signals are not easy to distinguish based on this parameter. It can be observed from Figure 14 that the rise time of MS signals is longer than the other two signals. However, these three kinds of signals still have more overlap in this parameter. In conclusion, it is not easy to distinguish the three types of signals with a single parameter, so it is necessary to use multiple parameters for comprehensive recognition.
According to the four characteristic parameters in Figures 3 and 14, the recognition of MS signals are classified as an indistinguishable linear problem. In this section, four commonly used recognition models, namely the SVM (Suykens and Vandewalle 1999), decision tree (DT) (Safavian and Landgrebe 1991), K-nearest neighbor (KNN) (Keller et al. 1985), and linear discriminant analysis (LDA) (Chengjun and Wechsler 2002), were used for preliminary assessment of the recognition accuracy of such signals. To determine accurate recognition results, the 5-fold cross-validation method was used to train the four recognition models. The recognition performance of the total and the individual of the four models are shown in Table 2. Figure 15 shows the confusion matrix of the four models taking the mean performance of five tests. It can be observed from Figure 15 and Table 2 that by using the SVM method, the recognition rate of MS signals is the highest (92.34%), followed by that of blasting signals (89.16%). In comparison, the recognition rate of mechanical signals is only 73.6%. The DT, KNN, and LDA methods are ranked in descending order according to the recognition effect. Their recognition accuracies for MS, blasting, and mechanical signals are lower than that of the SVM method. Therefore, when MS signals are recognized based on characteristic parameters of waveforms, higher accuracy can be obtained by using the SVM method.
Meanwhile, it is not difficult to see from Figure 15 that the SVM method has certain limitations. The method has a low recognition rate for mechanical signals, and 18.4% of mechanical signals will be recognized as MS signals, affecting MS data processing. Because of this, a new recognition model for MS signals needs to be established to overcome such limitations.

Signal recognition performance using SSA-CNN-LSTM hybrid model
Similarly, to determine accurate recognition results and validate the SSA-CNN-LSTM hybrid model, the 5-fold cross-validation method was also used to train the SSA-CNN-LSTM hybrid model. As a result, the recognition performance of the total and the individual is shown in Table 3. Besides, to verify the recognition performance of the SSA-CNN-LSTM model compared with other network models, the 5-fold crossvalidation of the single CNN, the single LSTM, and BP is also carried out, and the results are listed in Table 3. For details of BP network structure and parameter setting, please refer to the literature (Xu et al. 2021). The input to the single CNN is the time-frequency RGB image transformed by the STFT method from the original monitoring signal ( Figure 12). The input to the single LSTM and BP networks is the timefrequency series transformed by the STFT method. It can be seen from Table 3 that the standard deviation of recognition performance of the SSA-CNN-LSTM of the total and individual using fivefold cross-validation is smaller than other networks. That is to say, the recognition performance of the SSA-CNN-LSTM proposed in this article is stable. The stability of recognition performance of the LSTM is slightly higher than that of the CNN, but both are better than the BP. In addition, it is easy to find out from Table 3 that the signal recognition accuracy of the SSA-CNN-LSTM is significantly higher than those of the single CNN, the single LSTM, and BP. The total recognition accuracy of the single CNN and single LSTM networks reaches 89.72% and 91.96%, respectively, and the recognition performance of MS signal is more than 90%. It shows that CNN has an excellent ability to extract spatial features from time-frequency RGB images. LSTM have an excellent ability to extract temporal features during feature extraction from time-series data. By comparing the recognition performance of the network models, the SSA-CNN-LSTM can better integrate the advantages of CNN and LSTM. In addition, the recognition performance of the BP is the worst among the four networks. It is proved that the feature extraction ability of deep learning is better than that of a traditional neural network.
To compare the recognition performance of these four models in more detail, Figure 16 shows the confusion matrix of the four models taking the mean  Figure 17.
By analyzing Figures 16 and 17, it is concluded that the total recognition accuracy of the SVM method with the best recognition accuracy in Section 4.2.1 is only 85.03%. In comparison, the SSA-CNN-LSTM is improved by 9.53%. In addition, it can be observed from the confusion matrix of the SSA-CNN-LSTM that the recognition accuracy of blasting and MS signals are excellent, with recognition accuracy exceeding 94%. In addition, as described in Section 4.2.1, owing to the similarity of the MS and mechanical signals, the recognition performance of the two type signals is lacking: the recognition accuracies for mechanical signals with the SVM, DT, KNN, and LDA methods are only 73.6%, 72.0%, 74.0%, and 70.7%, respectively. In contrast, the recognition accuracies for mechanical signals with the CNN and LSTM rise to 89.2% and 86.5%, while that arising from the use of the SSA-CNN-LSTM is 92.5%, that is, only 3.2% of mechanical signals will be mistaken for MS signals. Thus, the recognition accuracy of the SSA-CNN-LSTM is improved significantly.
The SSA-CNN-LSTM hybrid model proposed in this research considers the spatial and temporal features simultaneously, which is excellent for signal recognition. It can recognize MS signals required for the research from the monitoring signals in a complex underground environment and meet practical application needs in Xiadian Gold Mine.

Discussion
A new hybrid signal recognition model-SSA-CNN-LSTM for MS monitoring in underground mines was proposed. The SSA-CNN-LSTM integrates the advantages of local spatial feature extraction of CNN, the advantages of temporal feature extraction of LSTM, and the ability of principal component extraction of SSA method, and achieves a good recognition performance of three types of signals. Compared with conventional methods and a single depth network model, it has a higher recognition performance and better stability. The number of the three types of samples studied above is balanced. When the number of three types of samples is imbalanced, the recognition accuracy needs to be  tested. Therefore, the MS, blasting, and mechanical signals were set to different proportions, with nine proportions. The grouping results are shown in Tables A1 and  A2 in Appendix A. To meet the calculation requirements, the number of some samples in Table A1 exceeds the number of measured signals. In this case, resampling is used to increase samples. The SSA-CNN-LSTM was used to test the recognition performance of the sample distribution described in Tables A1 and A2. The test results  are shown in Tables A3 and A4. The test results show that the sample's distribution has a significant impact on the recognition performance of the SSA-CNN-LSTM. It can be seen from ID 8 and ID 9 in Tables A3 and A4 that mechanical signal is most significantly affected by the number of samples. It shows that the mechanical signal has a strong similarity with the MS signal, so it needs more training samples to ensure the recognition performance. On the contrary, from ID 3 and ID 7 in Table  A3, it can be found that even if the number of MS signals is reduced to 1000, the recognition accuracy can still maintain more than 90%. Thus, to achieve the ideal recognition accuracy, different types of signals need a different number of training data. To achieve a higher total recognition performance, the samples with a lower proportion can be supplemented by resampling. As shown from Figure A1, using the oversampling on the imbalanced data can increase the SSA-CNN-LSTM performances to that of the SSA-CNN-LSTM trained with balanced data.
As can be seen from Figure 16(a), in the case of sample balance, although the recognition accuracy of MS signals reaches 97.2%, there are still 5.0% other signals that will be mistaken for MS signals, and the recognition model needs to be further improved. For the underground mine site, many factors are affecting the recognition performance of MS signals. For example, MS signals are usually mixed with other types of signals, which will change the frequency and amplitude characteristics of MS signals. Therefore, an excellent pre-processing method is an essential factor in improving the recognition performance, which is as important as the recognition model.
The pre-processing method of the SSA-CNN-LSTM has experienced the components extraction and STFT transformation, which takes a long time. At the same time, due to the introduction of LSTM, compared with the single CNN, the training time of the model is also increased. In addition, due to the complexity of the monitoring signals, it takes a long time to screen the monitoring signals manually. 7164 signals used in this article were processed for 15 days. Therefore, to ensuring the recognition performance, the model with fewer training samples is very important for processing mine MS monitoring signals. A more advanced CNN or LSTM network may further improve the recognition performance while reducing the training time and required training samples.
The network structural parameters of the SSA-CNN-LSTM are determined by the trial-and-error method. There is no mature theory to give the number of required network layers and neurons quantitatively. In this study, the same network structural parameters for the SSA-CNN-LSTM were used for all tests, and different results may be obtained with other parameters. Furthermore, due to the complexity of the mine environment, the network structure in this study is not necessarily suitable for other mines. Therefore, the SSA-CNN-LSTM needs to be adjusted when it is applied in other mines.

1.
A new hybrid networks model-SSA-CNN-LSTM based on SSA, CNN, and LSTM is proposed. This hybrid networks model combines signal principal component extraction, local spatial feature extraction, and temporal feature extraction. It provides a new idea for the effective recognition of MS monitoring signals 2. Based on the actual monitoring data recorded in the Xiadian Gold Mine, the monitoring signals are recognized by four commonly used methods, i.e., SVM, DT, KNN, and LDA. It is found that the recognition accuracy obtained by SVM is best among those four methods and its recognition accuracy for MS signals reaches 92.3%. However, the recognition accuracy of SVM for mechanical signals is only 73.6%, and 18.4% of mechanical signals will be mistaken for MS signals, which is too inaccurate to allow in situ practical application. 3. The proposed SSA-CNN-LSTM hybrid model is tested and compared with the CNN, LSTM, BP models, and four commonly used methods: the total recognition accuracy of the SSA-CNN-LSTM hybrid model increased to 94.56%. Furthermore, the individual recognition accuracies for MS, blasting, and mechanical signals reached 97.2%, 94.0%, and 92.5%, respectively. Therefore, only 3.2% of mechanical signals are mistaken for MS signals, indicating that recognition performance is further improved than other methods.

Disclosure statement
No potential conflict of interest was reported by the authors.

Data availability statement
The data that support the finding of this study are available from the corresponding author upon reasonable request. Table A2. Nine distributions of 3000 signals for researching on the impact of imbalanced training data for the SSA-CNN-LSTM model. ID 1 is not included since it was already balanced and thus was not oversampled.