A review: data driven-based fault diagnosis and RUL prediction of petroleum machinery and equipment

In this paper, an up-to-date overview is provided on the data driven-based fault diagnosis (FD) and remaining useful life (RUL) prediction problems of the petroleum machinery and equipment (PME). First, the FD and RUL prediction of five key components including bearings, gears, motors, pumps and pipelines are discussed by adopting mathematical statistics and shallow learning. Then, four kinds of widely-used DL models, i.e. deep neural networks, deep belief networks, convolution neural networks and recurrent neural networks, are surveyed, and the applications in the field of PME are highlighted. Finally, the possible challenges are proposed and some corresponding research directions in the future are presented.


Introduction
Serving as one of the most important roles in the petroleum industry, the petroleum machinery and equipment (PME) has received particular research attention such as the pipelines and the pumping units. Note that the construction of the pipelines not only increases the transport volume of the oil but also reduces the cost of the transportation. In addition, as the core equipment for oil exploitation, the pumping unit has been gaining an ever-increasing research attention, especially the smooth operation which is the basic requirement for oil production. Nevertheless, it is worth noting that once these equipments fail, the economic losses and casualties are unpredictable. For example, as for the pipelines, the longtime work will lead to the aging of the anti-corrosion layers, which probably brings corrosion perforation, leakage, rupture and other accidents. With regard to the pumping units, the crank pin fracture always results in frame torsion, gearbox overturning and other serious accidents. Moreover, the frequently occurred failures of the gear and bearing of pumping unit gearbox will severely affect the oil productions and the economic benefits. To solve these problems, the fault diagnosis (FD) and the remaining useful life (RUL) prediction issues of the PME have attracted tremendous research attention, which aim to ensure the safety production (Heng et al., 2009).
Till now, the researchers have paid much attention to the FD and RUL problems, see Layouni et al. (2017), CONTACT Hongli Dong shiningdhl@vip.126.com,shiningdhl@gmail.com S. , Shaik et al. (2020), Zadkarami et al. (2020) and the reference therein. In general, the FD and RUL are roughly divided into five steps as demonstrated in Figure 1: data acquisition, data preprocessing, feature extraction, feature selection as well as recognition and prediction. Taking the bearing of the pumping unit as an example, the five steps are explained as follows.
(1) Data acquisition refers to the vibration signal collections of the bearings from the installed acceleration sensors; (2) Data preprocessing aims to remove the outliers and noise by some methods (2021; L. Zou et al., in press-a), and then carries on the signal normalization; (3) The purpose of feature extraction is to extract some commonly used features including obvious features and implicit features from the collected data. To be more specific, the obvious features mainly include the time-domain features, frequency-domain features and time-frequencydomain features, and the implicit features are those learned autonomously by models such as deep learning (DL); (4) Generally speaking, according to different search strategies, the methods of feature selection can be classified into three categories on the basis of global optimal search strategy, random search strategy (e.g. relief algorithm Kira & Rendell, 1992), and heuristic search strategy (e.g. sequence forward selection algorithm Ververidis & Kotropoulos, 2005). The proposed selection methods are employed to select the features that are sensitive to the health states of the bearings from the extracted features so as to further improve the accuracy of the FD; and (5) Finally, by using the feature data sets, the purpose of recognition and prediction is to train the model for establishing the relationship between the selected features and the health states of the equipments. In view of this, for an unmarked vibration signal sample of the bearings, the model can identify the health states of the bearings and predict the RUL of them.
In order to extract and select the features of the running states of the PME, the MS method has been employed which, however, can only provide good diagnosis and prediction results for small-scale data. That is to say, it is difficult to process and analyse massive data automatically through the MS. Among the ML methods, the SL method is not expert in expressing complex functions when the samples and computing units are limited. In the recent years, the DL model has attracted much attention (Hinton et al., 2006), which is capable of extracting features from the source signals adaptively and avoiding the impacts from the artificial feature extraction. Moreover, owing to the emphasis on the depth of the model structure, DL has a good generalization ability and thus the low nonlinear performance of the shallow neural networks can be well solved. In addition, DL is capable of transforming the feature representations of the samples in the original space into the representations in a new feature space through layer-by-layer feature transformation, which makes it easy to classify the faults and predict the RUL. The unique advantages of DL provide a new idea for the research on FD and RUL prediction of the PME (J. Feng et al., 2018;Georg & Matthias, 2018). It is worth mentioning that the combination of other methods and DL has become a research focus, and some pioneering results have been reported. The main contributions of this paper can be highlighted as follows: (1) We have reviewed all searchable articles of DL for fault diagnosis and RUL prediction of PME.
(2) Four kinds of widely-used DL models, i.e. DNN, DBN, CNN and RNN, are surveyed, and the applications in the field of PME are highlighted. (3) To the best of our knowledge, this is the first comprehensive DL survey for FD and RUL prediction of PME. (4) We have reviewed the current status of the DL in FD and RUL prediction of PME and also highlighted the future opportunities.
The remaining sections of this paper are organized as follows. Several main methods based on MS and SL with their applications in FD and RUL prediction of the PME are introduced in Section 2. Then a few major methods based on DL with their applications in FD and RUL prediction of the PME are highlighted in Section 3. The possible challenges, research topics in the future and some corresponding research directions are summarized in Section 4.

Fault diagnosis and remaining useful life prediction based on mathematical statistics and shallow learning
In this section, a review of FD and RUL prediction methods based on MS and SL is given. First, some components of the oil production and transportation equipments are listed. Then, the FD and RUL prediction methods based on the MS and SL are introduced.

Important components of PME
In general, the PME is divided into the oil production equipments (e.g. pumping unit, hydraulic oil production devices and drilling rig) and the oil transportation equipments (e.g. pipeline and screw pump).
When it comes to the oil production equipments, the pumping unit will be the first one to be considered. As shown in Figure 3, the pumping unit is powered by the motor all the time (C. . The reducer of the pumping unit is an internal meshing transmission device composed of internal gears and external gears, which is responsible for reducing the speed of the motor. It should be noted that the crank and the connecting rod are connected by rolling bearings, then the beam and the steel support are connected by the bearings in the pumping unit. Furthermore, it can be observed from Figure 3 that the bearings exist in the reducer as well. For other oil production equipments such as hydraulic oil production devices and drilling rigs, the components like pumps are critical to their work. It is well recognized that pipelines have been the main oil transportation equipments in the past decades. The reasons for adopting pipelines to transport oil are twofold. One is that the transportation volume of them is large and the transportation cost is low. The other one is that the pipelines are buried underground deeply, so they occupy a relatively smaller area and are less affected by the climates and environments. Moreover, as another kind of important component, the screw pump is often chosen to connect with the pipelines due to that it does not form the turbulence and is not sensitive to the viscosity of the medium (Alves et al., 2011).
It is concluded from the previous contents that the components such as bearings, gears, pumps, motors and pipelines are playing extremely important roles in the work of the PME. Once the components work under high load and harsh environment without timely repair and maintenance, the components will be vulnerable to wear and corrosion over a long period of time. As demonstrated in Figure 4, two types of faults frequently occur in bearings, namely, the inner ring wear and the outer ring wear. Additionally, the gears often break down including cracking, wearing and fracturing. Consequently, it is of great significance to implement FD and RUL prediction for the components. Then some typical equipments and their main components with requirement on FD and RUL prediction are shown in Figure 5.

Short-time Fourier transform
It is claimed that Fourier transform is a commonly used mathematical transform method. As a significant extended signal processing method, the STFT has been proposed by Gabor in 1946, with the purpose of analysing non-stationary signals (Gabor, 1946). During the past years, with great efforts, some researchers have made some achievements in applying STFT to the FD of bearings (H. , gears (Bian et al., 2019) and pumps (W. Zhao et al., 2016). Particularly, the combination of other methods and STFT has attracted plenty of research attention. Y.  have adopted STFT and CNN to diagnose faults of bearings. In the proposed method, the STFT has been employed to obtain the input images of the bearing inner and outer race fault signals. After that, the extracted features have been fed into the CNN to diagnose faults sequentially. Similarly, in H. , the STFT has also been combined with the CNN, in which the STFT has been used to obtain the time-frequency spectrum samples of the vibration signals of rolling bearings. Jointly with CNN, the end-to-end fault pattern recognition has been realized. Moreover, the STFT has been applied to the transformation of signal dimensions as well. For instance, in Bian et al. (2019), the one-dimensional vibration signals have been transformed into a two-dimensional spectrogram by STFT, and then the obtained image has been input into the trained CNN model, which has achieved the FD of the gearboxes.

Wavelet transform
The WT has a time-frequency window that changes with frequency. The main idea of WT is to decompose a signal into a series of wavelets. It is worth mentioning that WT can fully highlight the characteristics of the considered problems and overcome the weaknesses that the window size of STFT does not change with frequency and the time resolution and the frequency resolution cannot achieve the best simultaneously (H. Sun et al., 2014). Inspired by J. , we have reviewed some practical applications during the past few years. In Liang et al. (2019), the WT has been combined with constructed multi-label CNN to diagnose the faults of the gears. To be specific, the two-dimensional time-frequency features have been extracted from the original one-dimensional vibration signals by WT and then input into the multilabel CNN model to further diagnose the faults with examples including fracture and cracking. Moreover, we have found many other variants of WT applied in FD, which include continuous WT (Ahuja et al., 2020;Parey & Singh, 2019), discrete WT (Bajric et al., 2016;Kordestani et al., 2018), wavelet packet transform (Islam & Kim, 2019;Wan & Zhang, 2018;L. Wang, Liu, et al., 2020) and empirical WT (H. . The continuous WT has been ever employed to decompose the angular domain averaged signals of the gearboxes (Parey & Singh, 2019). The obtained acoustic features such as crack and fracture have been compared with those of healthy gears, and then the optimal scale range has been determined according to the energy-Shannon's entropy ratio of the continuous wavelet coefficients. In R. , discrete WT, another variant of WT, has been used to identify the different health states of the gearboxes by combining with the CNN. In this method, discrete WT is capable of giving a more prominent and comprehensive representation of timefrequency distribution of the concerned signals, which is input into the trained CNN model for further FD. In addition, in order to analyse the inherent defect information consisting of different sizes of cracks on the outer, inner and roller raceway in bearing fault signals, the wavelet packet transform has been applied to quantify the input signals into plenty of low and high-frequency sub-bands signals, which have been fed into an adaptive deep CNN (Islam & Kim, 2019). In H. , the empirical WT has been used to decompose the three-channel signals of hydraulic pump into several AM-FM components, the weights of which have been measured through the defined variance contribution rate. Furthermore, the spatial clustering algorithm has been employed to improve the performance of the empirical WT in the whole space, which has solved the improper spectrum segmentation problem of the empirical WT.

Empirical mode decomposition
The EMD adaptively decomposes the signals through the time-scale features of the data themselves with no requirement on setting any basic functions, which is essentially different from the wavelet decomposition method based on wavelet basis functions. Therefore, the EMD can be applied to the decomposition of any types of signals theoretically. It is widely recognized that the EMD has been exploited in signal processing (Das et al., 2020;J. Zheng & Pan, 2020), data feature analysis (Aziz et al., 2019;Q. Li, Zheng, et al., 2020;Mondal et al., 2018) and FD (Abdelkader et al., 2016(Abdelkader et al., , 2018Alabied et al., 2018;Cheng et al., 2019;C. Guo et al., 2016;Park et al., 2018;Xu et al., 2020;Xue et al., 2015). In Xu et al. (2020), according to the frequency features of the signals processed by the combination of EMD and Hilbert transform, both the flexural wave and the longitudinal wave have been selected for the localization. Then the leakage sources in the pipelines have been located through the peak magnitude corresponding to the arrival time of the guided waves. During the past decades, breakthroughs on improving EMD have been achieved with examples including fast EMD (Y. Wang, Zhu, et al., 2020), median EMD (Karatoprak & Seker, 2019) and ensemble EMD (Abdelkader et al., 2018;Cheng et al., 2019;Park et al., 2018;L. Wang & Shao, 2020;Xue et al., 2015;Yeh et al., 2010). To be specific, the EMD algorithm is illustrated in Algorithm 1 (Lei et al., 2013) and the decomposition process is shown in Figure 6.
For improving the computation speed and inhibiting the phenomenon of modal over-decomposition of the EMD, the fast EMD has been proposed in Y. H. Wang et al. (2014). In Y. Wang, Zhu, et al. (2020), the fast EMD has been employed to automatically separate the original signals of the hydraulic pump in the FD of hydraulic pump, then the useful components of signals have been obtained by measuring the relative entropy. As another variant of EMD, the median EMD has added a process in which the decomposed IMFs have been filtered by a variable window sized median filter and summed again to recompose the signals. In Karatoprak and Seker (2019), Algorithm 1 The EMD algorithm.
1: Combine the local maximum and minimum points of the original signal f (t) to construct the upper and lower envelopes. 2: Construct the mean curve m(t) from the upper and lower envelopes in 1.
: Repeat 1, 2 and 3 constantly, until residual error r(t) meets the predetermined conditions.
the fault features of the motor have been extracted by the median EMD, thus the mode-mixing has been reduced and the fundamental frequency of each IMF has been separated. In Park et al. (2018), the ensemble EMD has been employed to extract fault features from transmission error signals of faulty gears under the noisy environment, then the k-nearest neighbour algorithm has been adopted to classify faults. Furthermore, it should be noted that ensemble EMD has evolved into some variants containing adaptive fast ensemble EMD (Xue et al., 2015) and complementary ensemble EMD (Cheng et al., 2019;Yeh et al., 2010) in the past few years. With respect to complementary ensemble EMD, it has been designed to reduce the computation time of the ensemble EMD, in which the pairs of complementary ensemble IMFs with positive and negative white noises have been introduced to eliminate the residue noises in the IMFs (Yeh et al., 2010). On the basis of the complementary ensemble EMD, the noises in pairs of positive and negative have been added to the residue signals for the sake of reducing the reconstruction errors and mitigating the effects of mode mixing. The above mentioned method has been adopted to analyse the non-stationary vibration signals of rolling bearings in the FD for bearings (Cheng et al., 2019).

Variational mode decomposition
The EMD is often accompanied by the phenomena of end effects and aliasing of modal components. With the purpose of solving these problems, the VMD has been proposed to deal with the non-stationary sequence signals, which is capable of determining the numbers of the mode decomposition of the given sequence according to the practice. Furthermore, in order to separate the IMFs effectively and divide the signals in frequency domain, the VMD adaptively matches the optimal central frequency and the finite bandwidth of each mode in the process of searching and seeking the solution. For clarity, the complete algorithm of the VMD is given in Algorithm 2 (Dragomiretskiy & Zosso, 2014). It is worth noting that the VMD has been adopted in the area of FD for a variety of equipments with examples including motors, pipelines and gears in the past decade (Y. Choi et al., 2018;Diao et al., 2020;Z. Guo et al., 2019;Ju et al., 2019;Y. Liu et al., 2016;Miao et al., 2019;Yan, Liu, Zhang, et al., 2020;D. Zou & Ge, 2019).

Algorithm 2
The complete algorithm of VMD.
In D. Zou and Ge (2019), the singular value decomposition has been adopted to calculate the effective numbers of the band-limited IMF of VMD. In order to observe the frequencies of the fault features of the motors, the filtered stator current is divided into finite band-limited IMFs through VMD decomposing, and then the Hilbert transform has been performed on each band-limited IMF. The proposed method is convenient to observe the fault characteristic frequencies from the processing results. Also processing signals with original VMD, in Y. , the VMD has been employed to process the check valve vibration signals in different states. Next, the feature vector set formed by the extracted different features of each layer of IMF components has been processed by vector quantization, which are then input into models so as to identify the fault types of the check valves.
It is worth noting that the intelligent optimization algorithms have developed rapidly in recent years, such as particle swarm optimization (PSO) algorithm (W. Liu, Wang, Yuan, et al., 2021;C. Wang, Han, Shen, et al., 2020). The optimization problems of the parameters of VMD through swarm intelligence algorithms have stirred a great deal of attention (Diao et al., 2020;Yan, Liu, Zhang, et al., 2020;, which mainly include particle swarm optimization (PSO) algorithm and grey wolf optimization (GWO) algorithm. In Diao et al. (2020), the PSO has been adopted to self-select the VMD parameter pair for removing the background noise from pipeline leakage signal more accurately. In Yan, Liu, Zhang, et al. (2020), the GWO algorithm has been employed to determine the inside parameters of the VMD for the FD of rotors.
In addition, it is worth pointing that adaptive mechanism has shown prominent place in terms of parameter optimization (Elsayed et al., 2017;Miao et al., 2019;Zamuda, 2017;. A parameter adaptive VMD method based on grasshopper optimization algorithm (GOA) has been shown in X. Zhang et al. (2018) for the sake of analysing the vibration signals of rotating machinery. Taking the weighted kurtosis index constructed by the kurtosis index and correlation coefficient as the optimization objective, the GOA algorithm has been adopted to optimize the VMD parameters for the extractions of the fault features. Similarly, towards the parameter adaptive VMD for FD, Miao et al. (2019) has shown another index to be the optimization objective, namely, ensemble kurtosis. The improved method has performed great improvements on the FD of bearings.

Kalman filtering
In recent years, the filtering has developed greatly, such as the set-membership filtering , the recursive filtering , the distributed filtering (Q. Li, Shen, et al., 2020;Ma et al., 2019) and so on B. Shen et al., 2019;Tian et al., 2019;L. Zou et al., in press-b). As for the field of FD and RUL prediction, two main filtering methods, KF and PF, have been given a lot of attention. In 1961, KF, another method has been proposed by R. E. Kalman, which is an algorithm for optimal estimation of system states by using system state evolutions and observation data (Kalman & Bucy, 1961). It should be noted that the KF is capable of improving the weakness that the signals and noises must be stationary in Wiener filtering (Kalman, 1960). During the past few years, the KF has been improved a lot , and the KF has been widely used in the fields of target tracking (Jondhale & Deshpande, 2019a, 2019b, state estimation (Z. Zanni et al., 2017;S. Zhao et al., 2017), signal processing (Dai et al., 2017), FD and RUL prediction (Asl et al., 2019;Cui et al., 2019Cui et al., , 2020 Cui et al. (2020) have constructed two KF models based on linear function and quadratic function, which have been applied to analysing slow degradation stages and accelerated degradation stage of the evolution of the monitoring data of the bearings, respectively. In addition, the relative error of the sliding window has been constructed to adaptively judge the bearing degradation stages. Also aiming at the bearing degradation, Lim and Mba (2015) has introduced the switching KF. In Asl et al. (2019), the switching KF based on statespace method has been adopted to infer different bearing degradation processes and applied to the life prediction of bearings successfully. An adaptive squareroot unscented KF has been proposed for leakage FD in a hydraulic system, which can adaptively estimate the means and covariances of the process noises, the measurement noises as well as the system states. Motivated by Lim and Mba (2015) and Asl et al. (2019), a new switching unscented KF has been utilized to predict the RUL of the bearings. The measurement error parameters of the filter have been selected as the standard deviation of condition monitoring data in the degradation state, which can effectively filter out the fluctuation characteristics of the data (Cui et al., 2019). Owing to the ability to deal with the nonlinear system, the extended KF has been adopted to predict the RUL of bearings under different operating conditions, in which the best approximating analytical function has been employed to learn the parameters of the extended KF. Furthermore, the extended KF has been widely applied in the RUL prediction of pipelines, see Moulahi and Hmida (2020). As another variant of KF, the vold KF can decompose the multi-component vibration signals into constituent mono-component harmonic waves, which have been utilized to construct the timefrequency representation. In the case of non-stationary speed, the localized and distributed faults of the gears have been successfully diagnosed (Z. Feng et al., 2019).

Particle filtering
Note that in some applications with high-accuracy requirements, the KF algorithm is far from satisfactory. Therefore, the PF has been proposed whose idea is to find a group of random samples propagating in the state space to approximate the probability density function. Then the integral operation is replaced by the mean of the samples based on the Monte Carlo method for obtaining the minimum variance distribution of the states (Arulampalam et al., 2002). It is claimed that the PF algorithm has been broadly used in the fields of FD and RUL prediction (C. Chen et al., 2012;Lei, Li, Gontarz, et al., 2016;K. Peng et al., 2019;Qian & Yan, 2015;Zio & Peloni, 2011). As early as 2011, the PF has already been employed to predict the RUL of mechanical components. To be specific, the Monte Carlo simulations of the state dynamic model and measurement model have been adopted to estimate the posteriori probability density function of the future state of the degraded components (Zio & Peloni, 2011). Subsequently, the high-order PF has been normally combined with some intelligent algorithms (C. Chen et al., 2012Chen et al., , 2011J. Yu, 2015).
Recently, the trained adaptive neuro-fuzzy inference system has been integrated in a high-order PF to describe the fault progression (C. Chen et al., 2012). In the above mentioned method, the high-order PF has been employed to estimate the current state and predict pstep-ahead state through a group of particles with the purpose of estimating the probability density function of the RUL. More recently, a massive number of methods have been proposed to improve the PF (Lei, Li, Gontarz, et al., 2016;K. Peng et al., 2019;Qian & Yan, 2015). For instance, particles in each recursive step have been used to determine an alterable importance density function. Moreover, the back propagation neural network has been adopted to improve the particle diversity before resampling. Compared with PF, the improved PF has a significant improvement with respect to predicting the RUL of the bearings (Qian & Yan, 2015). In K. Peng et al. (2019), a fuzzy inference system has been introduced to improve the diversity of the particles. The resampling smoothing method based on fuzzy inference system has been implemented in each recursion, then the weights of particles have been redistributed, which has reduced the relative distance between the weights and realized the outlier smoothing. The proposed method has achieved a higher accuracy than traditional PF in predicting the RUL of the PME.

Artificial neural networks
The ANN is a mathematical model that simulates the structure and the function of the biological neural networks, which consists of a large number of processing units connected with each other. Recently, the scholars have paid much attention to the ANN (J. Li, Dong, Wang, & Bu, 2020;J. Li, Dong, Wang, & Fei, 2020;. The ANN has the capability of self-learning and solving some problems such as classification and prediction (Bohorquez et al., 2020;Gunerkar et al., 2019;Laredo et al., 2019;Shaik et al., 2020;Sharma et al., 2019;Tavakoli et al., 2020). The specific structure of ANN is shown in Figure 8. The neurons are connected by certain weights, and the input is converted into the output through the activation function. Till now, the ANN has been successfully applied in faults classification for bearings (Gunerkar et al., 2019;Sharma et al., 2019) and pipelines (Bohorquez et al., 2020;Shaik et al., 2020;Tavakoli et al., 2020).
It should be pointed out that ANN is frequently combined with other algorithms. In Gunerkar et al. (2019), aiming at classifying multiple faults of bearings, the fault features have been firstly extracted from the time-domain signals by WT. Then the features have been input into the ANN to classify the faults, and the obtained experiment results have verified that the ANN has a better classification accuracy than K-nearest neighbour algorithm in multiple faults classification problem. Towards the pipelines, the ANN based on the fluid transient waves has been adopted for leak detection of pipelines (Bohorquez et al., 2020). Whereafter, in Tavakoli et al. (2020), the wall thickness loss of the pipelines has been chosen as the health indicator. Then the ANN has been used to train the model for RUL prediction by combining the adaptive neural fuzzy inference system, in which a deterioration evaluation index has established for training. In  Laredo et al. (2019), the combination of differential evolution algorithm and multi-layer perceptron has been conducted to predict the RUL of the mechanical system, in which the deterioration evaluation algorithm has been adopted to optimize the data-related parameters.

Support vector machine
In the implementation of the ANN, the empirical components often play an important role, while the SVM has strict theory and mathematical basis. The SVM has been first proposed by Cortes and Vapnik (1995), which is a small sample data training and classification algorithm based on supervised learning. As is shown in Figure 9, the original space is generally a low-dimensional space such that it is difficult to achieve linear classification of samples. Consequently, some feature extraction methods have been proposed for extracting the typical features to obtain a feature space, and then an optimal hyperplane formed by the SVM has been achieved for linear classifications.
With the rapid developments of the intelligent optimization algorithms, some optimization algorithms have been applied in parameter optimizations of the SVM C. Wang, Han, Zhang, et al., 2020;D. Yang et al., 2015;Yao et al., 2018). In Han et al. (2019), the fault features of the gears under different load excitations have been extracted from different domains, respectively, thereby inputting into the PSO-SVM model sequentially to classify the faults of the gears. In the proposed method, the peak index of the SVM has been optimized by the PSO algorithm. As an extension of the PSO, the leader-follower PSO algorithm has been adopted to select the super-parameter C of the SVM in C. Wang, Han, Zhang, et al. (2020), which has constructed a new model to detect pipeline leakage. The discriminative and robust features of pipeline leakage data extracted by sparse autoencoder networks have been input into proposed model for further classification.
It has been observed from plenty of academic and industrial research that the research status of the SVM is similar to that of the ELM. In the field of FD, the research concerning the SVM has been roughly divided into two aspects, which includes the parameter optimization of the SVM and the signal preprocessing (Gangsar & Tiwari, 2017;Gong et al., 2019;Keskes & Braham, 2015;R. Xiao et al., 2019;X. L. Zhang et al., 2015;J. D. Zheng et al., 2017). A multi-variable integrated incremental SVM has been shown in X. L. Zhang et al. (2015) for FD of rolling bearings. Specifically, the original signals have been extracted to form a number of multi-variable feature subsets, which have been input into the SVM model for the sake of obtaining the multi-faults classification results through ensemble and incremental learning. In R. , the WT and Release-F algorithm have been employed to select the most discriminative pipeline leakage features, which have been then input into the SVM classifier to identify the leakage severity of the pipelines. Furthermore, as one of the representatives of DL, CNN has also been combined with SVM to diagnose the incipient micro-faults of the Gong et al. (2019), in which the original data has been fed into the improved CNN-softmax model to extract features, and the sparse representatives of the features have been fed into SVM for fault classifications. The proposed method has higher diagnosis accuracy than the methods containing individual SVM, K-nearest neighbour and deep back propagation neural network.

Extreme learning machine
In essence, both the SVM and the ELM are used for classification in high-dimensional space through certain mapping. It is worth pointing out that the SVM is good at binary classification problem and the ELM is good at multi-classification problem. Compared with the SVM, the ELM has faster training speed (G. Huang et al., 2006). As is indicated in Figure 10, the ELM is a single hidden layer feed-forward neural network and only the number of the nodes in the hidden layer needs to be set, which exhibits fast learning speed and good generalization performance (G. Huang et al., 2015). It is worth noting that the parameter optimization methods and the distinctive feature representation methods have been combined with ELM frequently during the past decade . In Z. , the FD of the bearings has been considered, in which a novel krill herd algorithm has been adopted to optimize the parameters of the ELM. Then the extracted feature vectors of the faults of the bearings have been imported into the kernel ELM for FD. In Jiang et al. (2019), the Mel-frequency cepstrum coefficient, as a signal feature representation method, has been combined with ELM for FD of pumps. The features of the denoised sound signals have been fed into the trained ELM model, which leads to the optimal number of the neurons in the hidden layer. The proposed method has higher recognition accuracy and shorter training and diagnosing time .
Recently, the combinations of ELM and signal preprocessing methods have stirred a great deal of attention (Z. Chen et al., 2019; X. Rinanto et al., 2016;H. Shao et al., 2018;Wei et al., 2016;. It should be noted that the ELM is often used as a classifier. In H. Shao et al. (2018), together with deep wavelet autoencoder, the ELM has achieved higher accuracy of bearing fault classification. Multiple wavelet autoencoders have been employed to construct deep wavelet autoencoder for the sake of enhancing the ability of unsupervised feature learning, which is capable of capturing the features of the signals effectively. Then the captured features have been input into the ELM model to identify different faults of the bearings. In Wei et al. (2016), the ELM has been applied to identifying and classifying faults of the gears by combining with the local mean decomposition method. Furthermore, in X. , in order to identify the types of the faulty gears, the fault samples have been firstly obtained through establishing the finite element model of the gears. Then the measured vibration signals of the gear transmission system have been adopted to be the test samples, which are input into the trained ELM to realize fault identifications.

Methods based on deep learning
During the past decades, the FD and RUL prediction issues based on the SL have developed rapidly. Whereas in most cases, the SL needs prior knowledge and preprocessing technique before diagnosing faults and predicting the RUL, and the representation ability of the SL is limited as well. With the arrival of the big data era, it is difficult for SL to automatically process and analyse massive data. As such, as another significant branch of ML, DL has attracted many scholars' attention. On account of the strong ability of feature extraction, the DL has been successfully applied to the training of massive data.

Deep neural networks
DNN is a kind of artificial neural networks where multiple hidden layers exist between the input layer the and output layer. It is claimed that the hidden layers are trained through massive data to improve the accuracy of clustering or prediction (Hong et al., 2019). DNN is a multi-layer neural network formed by the stack of multi-layer feature representation models which generally include autoencoders, stacked autoencoders (G. Tan et al., 2015;, denoising autoencoders (Lu et al., 2017;Meng et al., 2018), sparse autoencoders (Z. Chen & Li, 2017;W. Sun et al., 2016;Wen et al., 2019) and so on. It can be seen from Figure 11 that the autoencoder mainly consists of an encoder and a decoder. The encoder is capable of mapping the input data from a high-dimensional space to a low-dimensional space and obtaining the feature representation of the input data. On the contrary, the decoder is capable of mapping the input data from a low-dimensional space to a high-dimensional space and reproducing the input.
As is shown in Figure 12, stacking multiple autoencoders can form the stacked autoencoders, i.e. a DNN. In J. , the stacked autoencoders have been utilized to extract features, which have been then adopted to train the softmax regression classifier to identify the faults of the gears. The stacked autoencoders have been combined with the wavelet soft threshold method and back propagation network for FD of the bearings as  well (Tan et al., 2015). To be specific, the stacked autoencoders have been employed to extract the features of the fault vibration signals denoised by the wavelet soft threshold method, which have been regarded as the input of the back propagation network to train the model. Furthermore, as a technique to solve the over-fitting problem, dropout has been introduced into the stacked autoencoders (G. . Together with the ReLU activation function, the method has been used to capture the salient features of the faulty gears. In the presence of noises, the robustness of the autoencoders is far from satisfactory, and thus the denoising autoencoders have been proposed to reduce the sensitivity to the noises. In Meng et al. (2018), with the addition of the regularization terms, the noise level has been decreased at the back layers so as to adapt the different robustness of the different layers. The improved method has shown a high accuracy in FD of rolling bearings. Moreover, in Lu et al. (2017), the denoising autoencoder has been combined with the stack for FD of the rotating machinery, which has shown good robustness in iterative learning.
On the basis of the traditional autoencoder, some sparse constraints have been introduced into the neurons in the hidden layer of the autoencoder to obtain the sparse autoencoder. In Z. Chen and Li (2017), the time-and frequency-domain features have been input into multiple two-layer sparse autoencoders for feature fusion. After that, the fused feature vectors, (i.e. the machine health indicators), have been fed into the DBN for further fault classification of rotating machine. Subsequently, breakthrough on three-layer sparse autoencoders has been achieved for FD of bearings in Wen et al. (2019), which has been adopted to extract the features of the original data. Meanwhile, the discrepancy penalty between the features of the training data and the test data has been minimized through applying the maximum mean discrepancy term, which leads to a higher diagnosis accuracy than ANN, SVM and so on. In W. Sun et al. (2016), to identify the faults of the induction motor, the partial corruption is added into the input of the sparse autoencoder by means of denoising coding so as to improve the robustness of the feature representation. The features learned from the sparse autoencoder are then used to train a neural network classifier for identifying the faults.

Deep belief networks
When it comes to DBN, the restricted Boltzmann machine (RBM) can not be ignored, which contains two layers, i.e. visible layer and hidden layer. It should be pointed out that the neurons in one layer are connected to all the neurons in the other layer, whereas there are no connections between the neurons in the same layer. The structure of RBM is shown in Figure 13. In order to solve the linear inseparable problem, Hinton et al. have proposed the DBN in 2006(Hinton et al., 2006, which are composed of multiple RBMs as shown in Figure 14. The training process of DBN has been divided into two stages, the first one of which is pre-training. Specifically, the RBM has been trained by greedy training algorithm from the bottom to the top. Then the output of the current hidden layer has been regarded as the input of the next RBM, which has been repeated constantly. The parameters of the model  have been transferred layer by layer and optimized continuously to achieve local optimum. The second stage is reverse fine-tuning. In the last layer of DBN, the back propagation algorithm has been employed to propagate the training error from the high layer to the low layer and finetune the whole DBN, which therefore achieves the global optimum. It is widely recognized that the DBN has been successfully applied in the fields of leak detection, FD and so on (Y. S. Wang et al., 2018;Zeng et al., 2021). In recent years, the DBN has been utilized by combining other methods (H. J. Huang et al., 2019;Lang et al., 2018;Zhu et al., 2020). For example, in Zhu et al. (2020), the principal component analysis has been implemented to reduce the dimension of the original vibration bearing signals and then the fault features have been extracted according to primary eigenvalues and eigenvectors, which have been input into the DBN model. As another well-known signal preprocessing method, the wavelet packet analysis has been adopted to remove the noise in Lang et al. (2018), and then the denoised signals have been fed into the DBN with independent component regression to recognize different leak apertures. It should be noted that the classification accuracy has been improved when using independent component regression instead of the traditional gradient fine-tuning in the DBN.
It is crucial to optimize the parameters in the DBN. Some recent advances in the field of parameter optimization have stirred a great deal of attention (Gai et al., 2020;He et al., 2017;K. Yu et al., 2020). In K. , the combination of hybrid genetic algorithm (GA) and PSO has been employed to optimize the parameter of DBN for the classifications of the states and severities of the bearing faults. Next, combining with the dempster-shafer theory, the classification results of DBN have been further fused and hence the FD accuracy has been improved. It is worth noting that the GA has also been adopted to optimize the parameters of the DBN in He et al. (2017). With less requirement on prior knowledge, the GA-DBN model has adaptively utilized the robust features related to the faults and achieved the unsupervised FD of the gear transmission chain. Moreover, in order to diagnose the faults of gears more accurately, the learning rate and batch learning number with a great influence on the training error of DBN have been optimized in Gai et al. (2020). The above mentioned methods have solved the weakness of the generalization ability of the single classifier under the big data environment.
Recently, much attention has been devoted to improving the internal structure of the DBN so as to improve the performance (Che et al., 2020;Qin et al., 2019;. For instance, a method based on domain adaptive DBN has been proposed for FD of rolling bearings in Che et al. (2020). To be specific, the domain adaptive method has been used to calculate the discrepancy of the multi-kernel maximum mean in the last two hidden layers, which has been introduced into the loss function to fine-tune the model parameters so as to improve the generalization ability of the model. In addition, the activation function of the DBN in FD of the gearboxes has been improved in Qin et al. (2019) and the traditional Sigmoid function has been replaced by the improved logistic Sigmoid (Isigmoid) function, which has overcome the gradient vanishing problem in the process of back propagation. Furthermore, a multi-scale cascading DBN has been constructed for the automatic fault identification of the rotating machinery in , which can not only directly learn the highlevel feature representation from the multi-scale features, but also automatically identify the faults through cascading and softmax classifier.

Convolution neural networks
As a kind of neural networks, CNN has become a research hotspot in the field of ML, the basic structure of which has been composed of input layer, convolution layer, pooling layer, fully-connected layer and output layer as shown in Figure 15.
It should be pointed out that the one-dimensional (1-D) CNN is capable of processing time series data. With this property, the 1-D CNN has been widely applied in FD and RUL prediction during the past decade (Ince  Wu et al., 2019). In C. Wu et al. (2019), the relationship between the convolution kernel length and the accuracy has been obtained, and the number of layers has been determined by checking the distribution of the weight vector. Similarly, the constructed 1-D CNN has been employed for FD of the gearboxes.
The two-dimensional (2-D) CNN has been viewed as a significant technique to solve the image problems. In the past decade, it is found that 2-D CNN has more applications in the fields of FD and RUL prediction than the 1-D one (Grezmaka et al., 2019;S. Guo et al., 2020;T. Li et al., 2020;S. Shao et al., 2020;Wen et al., 2018;Xie et al., 2020;B. Yang et al., 2019;Zhu et al., 2019). For example, in Grezmaka et al. (2019), the layer-wise relevance propagation that is adopted to explain the correlation between the time-frequency spectrum and the fault types has been introduced into the 2-D CNN to diagnose the faults of the gearboxes, in which the WT has been firstly adopted to convert the vibration signals into time-frequency spectra images, and then input into the CNN to achieve the fault classification. In T. , a multi-scale multi-sensor feature fusion CNN has been proposed to extract the signal features with different scales and the information fusion has been performed on the data and feature level. The proposed method has a higher accuracy and faster convergence speed than the standard CNN in the gear fault detection.
In F. , a cascade structure has been established in the CNN for FD of motors, which has avoided the information loss caused by the consecutive convolution striding or pooling. In addition, a divide-andconquer parameters optimization algorithm has been used to make the cascade CNN converge to a better state, which has improved the diagnosis performance. As well as RUL prediction, two CNNs have been combined to predict the RUL of bearings. The incipient failure points have been identified by one CNN and then the RUL has been predicted by the other one. It is worth mentioning that an intermediate reliability variable has been mapped to RUL by a proposed mapping algorithm, which has achieved higher prediction accuracy and better robustness than other methods (B. Yang et al., 2019). Furthermore, the combination of multi-scale CNN and time-frequency representation has been proposed to predict the RUL of bearings in Zhu et al. (2019). In this method, the WT has been adopted to obtain the time-frequency representation containing a lot of useful information, which has been then given to the multi-scale CNN that is capable of keeping the global and local information synchronously and learning the significant features automatically.

Recurrent neural networks
With the developments of artificial intelligence, the research on RNN has sprung up gradually. The Jordan network (Jordan, 1986) in 1986 and the Elman network (Elman, 1990) in 1990 have been regarded as the earliest RNN. The RNN takes the sequence data as the input and performs the recursion in the direction of the sequence. Additionally, all the recurrent units are connected by chains. As shown in Figure 16, the value of the hidden layer not only depends on the current input, but also relies on the previous hidden layer. Consequently, it is suitable for RNN to deal with the time series data.
The applications of RNN and its variants in the fields of FD and RUL prediction have been increasing in recent years (L. Guo et al., 2017;F. Li, Chen, et al., 2019;Y. Peng et al., 2017;J. Yu et al., 2019). In F. Li, Chen, et al. (2019), a reinforcement learning unit matching RNN has been proposed to predict the degradation tendency of the rolling bearing. Specifically, the moving average singular spectral entropy has been selected as the state degradation feature of the rolling bearing, which has been input into the reinforcement learning unit matching RNN to complete the trend prediction of the rolling bearing states. Meanwhile, there are also some indicators established in L. Guo et al. (2017) such as the bearing RUL prediction indicator based on RNN. In Y. Peng et al. (2017), a RNN-based fault prediction method has been proposed for the gearbox, where the health index extracted from the generator current signal has been employed to quantify the health state of the gearbox. In the proposed method, the RNN has been established and trained by using the real time recurrent learning algorithm online to predict the health state.
The above discussions have verified that the RNN has successfully applied in many fields. Nevertheless, the RNN is prone to the problems of gradient explosion and gradient disappearance in the process of training and thus the gradient cannot be transmitted in longer sequence. In order to solve the gradient disappearance problem of RNN, some variants of RNN such as long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) have stirred much research attention in recent years. The LSTM has the memory unit and gate unit as shown in Figure 17. It should be noted that there are three types of gates in the LSTM, in which the forgetting gate controls the number of the information that can be transmitted from the previous unit to the current unit, and the input gate controls the number of the input information that can be saved in the current unit. Additionally, the output gate controls the number of the information that the current unit can output.
During the past decade, many research groups throughout the world from not only academia but also industry have been motivated to apply the LSTM to the field of FD D. Xiao et al., 2019;Ye et al., 2020;Yin et al., 2020). For example, an intelligent FD framework of the bearing based on LSTM has been proposed in An et al. (2020), in which the sample has been firstly segmented, and then each segment dimension of the sample has been extended by the input network with the hope of ensuring that there is adequate information memory space. Then the classification information has been stored and transferred in the LSTM network such that the health states can be classified by the output network. It should be noted that the LSTM has also been combined with other methods frequently. In D. , the recurrence quantification analysis features extracted from the recurrence plots and the empirical statistical features extracted from the time-and frequency-domain have been combined with the features extracted by the LSTM for constructing a hybrid feature set. Then the combined features have been input into a weighted batch normalization block. Finally, the faults have been diagnosed through a three-layer fully connected neural network. Additionally, the optimization of LSTM has attracted much research attention. For instance, in Yin et al. (2020), an optimized LSTM network with cosine loss (Cos-LSTM) has been established to diagnose the faults of gearboxes. In practice, the cosine loss is critical to eliminating the influence of signal strength and improving the diagnosis accuracy. Especially, the cosine loss is capable of converting the loss from the Euclid space to the angular space. The energy sequence features and wavelet energy entropy have been introduced into the Cos-LSTM networks such that the diagnosis accuracy of the Cos-LSTM is higher than those of other classical methods.
Compared with the DNN, the DBN is more competitive in dealing with the unlabeled classification problem. As for the high-dimensional data, the CNN combining the parameter sharing mechanism is commonly employed to reduce the computational complexity such that the classification problem of the high-dimensional data, e.g. the image, can be solved effectively. In addition, for timedependent problem, it is difficult for the DNN to capture the temporal relationships. Therefore, much effort has been devoted to the RNN owing to its advantages in dealing with the time-dependent problem, where the value of the current hidden layer not only relies on the value of the current input, but also depends on the value of the previous hidden layers.

Conclusion
In this paper, we have reviewed the latest research progress based on the FD and RUL prediction. The structures and applications in FD and RUL prediction of PME based on MS and ML are investigated, in which the applications based on DL are summarized more specifically. The discussed methods in this review and the corresponding references are summarized in the following Table 1. Inspired by the above literature review, some critical and unresolved main issues in the future are listed as follows.
(1) How to deal with the poor-quality data of the PME.
(2) How to address the feature correlation and the data limitation in data space. The feature dimension of the data space is always high, and thus it is difficult to distinguish the strong correlation between the features such as the feature correlation of the FD. As another disadvantage that cannot be ignored is that the data category is unbalanced because the equipments always work under normal conditions for getting normal data, therefore, the number of the fault sample data is limited. At present, to the best of our knowledge, the methods to deal with these problems are still at the early stage. (3) How to solve the mismatch problem of the data collected by multiple sensors. Different sensors exhibit different sensitivities to the PME, which may lead to the mismatch problem of the collected data, thereby resulting in different data transfer rates. That is to say, if the multi-rate data collected by multiple sensors does not match each other, the difficulties of FD and RUL prediction will increase dramatically. Therefore, the research on multi-sensor information fusion is also a problem to be considered. (4) How to suppress the noise and decompose the components of the main features of the signals in FD.
In reality, the signals to be decomposed are usually consisted of the noise and the disturbance which are not related to the desired features of the signals, and thus the important fault features cannot be found in time. In addition, the expected signals are usually relatively weak and easy to be drowned by noise/disturbance (M. . Consequently, more research should be conducted on suppressing the noise/disturbance so as to improve the resolution and the robustness of the decomposition method. (5) How to distinguish the multiple faults. In general, there may be multiple faults in a single component, and thus how to distinguish them and take measures to settle them should be considered adequately.
For the above problems, the corresponding suggestions are provided for readers to study.
(1) Establish a standard big database of the PME. First of all, a unified measurement standard of the data quality should be constructed, and then the test data of the experimental platform and the long-term monitoring data of PME should be shared with each other such that the data deficiencies are made up. In addition, some data cleaning methods such as the spatial clustering algorithm can be adopted to clean the abnormal data. Some methods conducive to data transmission can also be introduced, such as coding-decoding communication protocol (L. . Moreover, more attention should be paid to the research of multi-scale and dimension conversion so as to provide high-quality data for training diagnostic models based on ML. (2) Increase the capacity of the data. For the problem of small samples with limited data, the methods of under-sampling, over-sampling, cost-sensitive learning and unbalanced learning may be employed to preprocess the data for solving the uneven distribution of data categories. Meanwhile, the flow information, transfer learning, generative adversarial networks (GAN) (Goodfellow et al., 2014;W. Liu, Wang, Tian, et al., 2021) and other methods should be tried to enhance the depth and breadth of the data.
(3) Focus on the developments of data fusion model.
The unified data fusion framework may be established including the general process of data processing, the unified model structure of data fusion space, the fusion results of data normalization and so on. Furthermore, the heterogeneous information should be preprocessed such as the unified representation, conversion and de-heterogeneity. Then the multidimensional information should be calibrated and spatiotemporal matched for maintaining the consistency of multi-sensor data information and ensuring the effects of data fusion. (4) Develop the denoising methods. With the rapid developments of the signal processing, the denoising methods are becoming more and more mature.
A variety of denoising methods may be combined with the purpose of complementing the advantages and disadvantages of each method and achieving a better denoising result. In addition, the antiinterference technologies such as spatial filtering (Wong et al., 2020;D. Wu et al., 2018) and spacetime adaptive filtering (Blunt et al., 2017) may be employed to improve the anti-interference ability of the PME signals. (5) Determine the types of the faults. The law of the effects of different faults on the life degradations of the PME components should be studied, and a hybrid model of component failure and RUL should be established as well. According to the model, the ranges of the faults will be judged and thus the types of the faults can be determined.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported in part by the National Natural Science Foundation of China under Grants 61873058 and 61933007.