Nonlinear dynamic process monitoring using deep dynamic principal component analysis

Data-driven method has gained its popularity in fault detection. Conventional methods are associated with one-single-layer process monitoring. Information extracted by such a method may not be sufficient to detect some faults for complicated process systems. Inspired by the deep learning conception, a multi-layer fault detection method, namely Deep Principal Component Analysis (DePCA) was proposed previously in the literature. DePCA has the capability to extract deep features for a process resulting in better fault detection performance. However, it assumes that the value of the variable at each moment is unrelated, which is not suitable for complex nonlinear dynamic system. To address the concerns, by adopting dynamic PCA to extract dynamic features, a new deep approach, namely Deep Dynamic Principal Component Analysis (DeDPCA), is proposed. In the new approach, both Dynamic feature and nonlinear feature can be extracted in different layers so that more process faults can be detected. A Tennessee Eastman process case study was then employed for application and validation of the DeDPCA, which indicates the proposed method is suitable for monitoring complex dynamic nonlinear processes.


Introduction
Since traditional detection systems for chemical process are lack of reliability (Yin et al., 2012), data-driven methods such as principal component analysis (PCA) (Wold et al., 1987) has gained it popularity particularly in the era of industrial digitalization. Using the data-driven methods (Gao et al., 2013), even if the mechanism and knowledge of the system are not known, one can bypass the complex procedure of establishing a mathematical model of the system and discover the faults based on changes in data.
The PCA is to extract several orthogonal principal components from the original multi-dimensional data space, thereby simplifying the analysis model. These principal components contain most variations of the original data (Wold et al., 1987). The PCA was developed based on the presumptions that the system is linear and static, however, most industrial processes are nonlinear and dynamic. Kramer (1991) proposed a nonlinear principal component analysis (NLPCA) method which uses a feedforward neural network to represent the feature mapping process such that the network inputs are reproduced at the output layer. Lee et al. (2004) developed another CONTACT Shuanghua Yang yangsh@zju.edu.cn This article has been republished with minor changes. These changes do not impact the academic content of the article.
nonlinear process monitoring method, so called as the kernel principal component analysis (KPCA), which calculates the principal components in a high-dimensional feature space using a kernel function, such as gaussian, polynomial and so on. To extend PCA application to dynamic systems, Ku et al. (1995) applied a 'time lag shift' strategy to contain the dynamic information resulting in the so-called Dynamic PCA (DPCA). Deep learning (LeCun et al., 2015) is a recently developed machine learning technology, which has been used in many fields like image identification, natural language processing, etc. It consists of multiple layers of nonlinear mapping (Wu & Zhao, 2018) and are aiming at learning feature hierarchies where features from higher levels of the hierarchy are formed by lower level features (Pan et al., 2020). Recently, deep learning has been used in process monitoring and fault detection (Iqbal et al., 2019). Deng et al. (2018) raised a multilayer and multivariate statistical model to extract different categories of data features, such as linear and nonlinear principal components, which is named deep principal component analysis (DePCA). While DePCA is more sensitive to a fault than KPCA and serial PCA (SPCA) (Deng et al., 2016) which are commonly used in process monitoring, it cannot extract the dynamic information of a dynamic chemical process.
The DePCA which combined PCA and KPCA assumes that the values of the variable at different moments are unrelated (Russell et al., 2000). Nowadays most industrial production is nonlinear and dynamic. Only considering nonlinearities without addressing auto-correlation and cross-correlation in measurement data is not enough.
In this paper, based on extensive analysis of the pros and cons of different data-driven methods, deep dynamic principal component analysis (DeDPCA) that integrated DPCA and KPCA was developed. Dynamic feature extracting was conducted by the DPCA layer, while nonlinear feature was extracted by the KPCA layer. DPCA explores the temporal autocorrelation of sampled data by augmenting the matrix. Considering that extending time series after kernel mapping may increase computational complexity and lose some dynamic information, DPCA is used to extract dynamic features in the first layer and nonlinear features are extract in the second layer. Bayesian inference was used to fuse information from different layers and conduct the overall decision. Autocorrelation analysis was used for each variable to decide how much time lag should be reserved.
The paper has several contributions. Firstly, a deep DPCA model that combining DPCA and KPCA was proposed. It has the potential to improve the performance of process monitoring. Secondly, it is demonstrated through the TE case study that the DeDPCA is able to detect faults for complex processes. Finally, an approach to optimize the time lags for each variable is proposed to reduce computational complexity but still to keep a good performance.
The work is organized as follows. Section 2 illustrates the algorithm and model structure of our proposed method. Section 3 demonstrates a case study on Tennessee Eastman(TE) plant to verify the superiority of the proposed method. Section 4 discussed on the outcomes and limitation. Section 5 draws the final conclusion of the work.

DeDPCA algorithm
The DeDPCA method contains two layers of feature mapping as shown in Figure 1, one is the DPCA dynamic linear feature layer and another is the KPCA nonlinear feature layer.
In this method, firstly, the normal dataset Y Y Y ∈ R n×m including n samples and m variables is expanded as an augmented matrix Y Y Y(l) ∈ R (n−l)×m(l+1) with a lag of l, as shown in Equation (1). Then the dynamic linear principal components can be extracted from Y Y Y(l) and the firstlayer features are extracted as T T T (1) ∈ R (n−l)×N 1 , where N 1 = m(l + 1). Then the nonlinear principal components can be extracted by applying KPCA mapping method to T T T (1) to extract the second-layer features T T T (2) ∈ R (n−l)×N 2 , where N 2 is the number of nonlinear principal components reserved in the second layer.
The DPCA methods catch the dynamic information by extending each observation vector with the previous l observations and constructing the data matrix as follows y y y t y y y t−1 · · · y y y t−l y y y t−1 y y y t−2 · · · y y y t−l−1 . . . . . . . . . . . . y y y t+l−n y y y t+l−n−1 · · · y y y t−n where y t y t y t is the m-dimensional observation vector at time t. By using PCA on Y Y Y(l), a multivariate autoregressive is extracted directly from the data.
The optimization task for the first layer could be wrote as from which projection vectors q (1) i ∈ R m(l+1) can be obtained. Score vectors can also be calculated as f f f . Therefore, the first layer features are constructed by all these score vectors, . Similarly, the optimization task for the second layer could be written as from which the nonlinear projection vector q (2) can be derived as follows: (3) can be written as from which the projection vectors β i ∈ R n−l can be obtained. The nonlinear score vectors in the second layer can also be calculated as t t t (2) Therefore, the second layer features are constructed by all these score vectors, For the testing vectorỹ y y t = [y y y t y y y t−1 . . . y y y t−l ], the feature vector for the first layer Then the second layer feature vectors f f f t is the kernel vector and can be calculated by From above, feature extraction from layer to layer does not need complex network optimization, which is much convenient to train a deep model when compare with other deep learning methods such as convolutional neural networks (CNN) and long short-term memory (LSTM).
After getting the features of each layer, their monitoring metrics can be computed. In first layer, T 2(1) and Q (1) metrics can be calculated as is the refactor ofỹ y y t .
In second layer, T 2(2) and Q (2) metrics is calculated as where 2 ∈ R N 2 ×N 2 is a diagonal matrix whose diagonal elements are the eigenvalues

Time lag selection
The time series expanding takes the dynamic information into consideration, but it would also increase the computation complexity. The autocorrelation for each variable is different. Thus expanding all the variables to the same time lag is unnecessary. To eliminate the autocorrelation of variables, separate analysis for each variable can be used to choose an appropriate lag l for each variable.
The autocorrelation of a variable is denoted as γ and a strong autocorrelation standard is denoted as γ min . If γ ≥ γ min , it has strong autocorrelation, otherwise it has weak autocorrelation. The γ min is set to 0.5 empirically.
When deciding the time lag of variable y 1 , l is first set to 0. If γ (l) ≥ γ min and γ (l + 1) ≤ γ min , [x 1,t−1 x 1,t−2 . . . x 1,t−l ] would together form a new sample vector at time t, where l ≤ l max , else l:l + 1 and repeat the above process. The final l is the current time lag for variable y 1 and l max is the maximum time lag because timing expansion is finite dimensional.
Using this method, the time lags for variables with strong autocorrelation are large and with weak autocorrelation are small. Hence the dynamic information of the process can be kept as much as possible while reducing the number of expanded variables, thus ultimately reducing the computational complexity.

Control limit calculation
Since Gaussian assumption is not applicable to the most industrial (Wang et al., 2016), the control limits can be calculated using actual probability density functions. Kernel density estimation (KDE) (Odiowei & Cao, 2009) has been wildly used for estimating probability density function (PDF), particularly for univariate stochastic events. Hence it can be used to estimate the PDF of T 2 and Q metrics.
The PDFp(x) of a variable at point x can be estimated using the kernel function K(·) that can be wrote aŝ where x k is the sample of x, k is the sample number and h is a smoothing parameter called bandwidth.
Given a confidence interval α, the control limits T 2 lim and Q lim can be calculated using the PDF by Equation (10).

Fault information fusion
In DeDPCA model, there are two monitoring statistics T 2(s) and Q (s) in each layer, where s = 1, 2. For the sake of fusing all the information of different layers together, Bayesian inference was adopted to merge informations of all layers. Firstly, Bayesian inference (Woolrich et al., 2004) is used to convert each monitoring statistic into posterior fault probability. Then, these probabilities are weighted and summed together. Finally, the probability-based combined monitoring statistics (Cai et al., 2017) P(T 2 ) and P(Q) is determined.
For monitoring statistics T 2(s) and Q (s) , the fault probability of sample y y y t in fault condition C f , P T 2 (y y y t |C f ) and P (s) Q (y y y t |C f ), can be respectively defined as Similarly, the fault probability of sample y y y t at the normal condition C n . P (s) where κ is a parameter used to reduce sensitivity to data that is abnormal. It was equal to 0.1. Using Bayesian inference theory, T 2(s) and Q (s) can be changed to the posterior probability P (s) T 2 (C f |y y y t ) and P (s) where P (s) T 2 (C f ) and P where P (s) T 2 (C n ) and P (s) Q (C n ) are the prior normal probability. They are set to 1 − δ.
Giving weights for P T 2 (C f |y y y t ) and P Q (C f |y y y t ), two overall monitoring statistics metrics P(T 2 ) and P(Q) are constructed where ω where 0 < ε < 1 is a small number, and ε is set to 0.01. P(T 2 ) and P(Q) fuse the results of different feature layers to show the process operation status. There is no fault if P(T 2 ) < δ and P(Q) < δ, otherwise a fault has occurs.

DeDPCA process monitoring procedure
The monitoring process can be divided into two parts: hence offline modelling and online monitoring. For the first part, normal operation data is collected as training dataset. Then DeDPCA mapping model is established using this dataset and the control limits are computed. For the second part, the online sample is collected and mapped into DeDPCA model to see whether it is normal or faulty. The main algorithm flowchart is shown in Figure 2.
Part 1: Offline modelling (1) Collect the samples in normal condition and do standardization.
(2) Build the DeDPCA model using the training data.
(3) Map the training data into the DeDPCA space and calculate the monitoring statistics T 2(l) and Q (s) for each layer, s = 1, 2. (4) Compute the control limits T 2(s) lim and Q (s) lim through KDE method, s = 1, 2.
Part 2: Online monitoring (1) For a new sample y y y t , normalize it with the mean and variance of the normal training dataset and construct expansion vectorỹ y y t . (2) Mapỹ y y t into the DeDPCA space and calculate the monitoring statistics for each layer. (3) Compute the overall probability-based monitoring statistics P(T 2 ) and P(Q). (4) Determine whether there is a fault by comparing P(T 2 ) and P(Q) with their corresponding significance levels.

Case study
The proposed DeDPCA method was then compared with the KPCA, DPCA and DePCA methods. The comparison is via a case study by applying them to the Tennessee Eastman process. DePCA combines PCA and KPCA, while DeDPCA combines DPCA and KPCA to construct the deep model. To confirm the excellence of the proposed DeD-PCA method, firstly it was compared with DePCA to examine the importance of deep mining. Then we changed the feature extraction methods in different layers to determine which structure is better. Finally, the improved DeD-PCA was compared with DeDPCA to show the benefits of extending time series based on each variable.

TE process
The Tennessee Eastman process (e.g. Lyman &Lau et al., 2013) is a benchmark process whose flowchart is illustrated in Figure 3. The TE process is constructed by five main units, hence reactor, condenser, compressor, stripper and separator and eight material components that represented from A to H. In process modelling, 52 variables are selected as monitored variables, which include 12 manipulated variables, 22 measured variables, and 19 composition measurements. Training and testing datasets were collected from the TE simulation for 48 h of operation. The data of each variable was collected in every 3 min. In the simulation, the fault is introduced at 8 h. Each data set contains 960 samples and faults occurs at the 161th sample. The relative simulation results can be obtained from Braatz (2009).

Implementation details
The Gaussian kernel function was chosen with a kernel width of σ = 100n for nonlinear mapping based on tests. For DeDPCA, the lag l is set to 2 according to Ku's theory (Ku et al., 1995). The numbers of the reserved principal components are decided by over 90% of the total sum of eigenvalues to keep most of the information. Hotelling's T 2 and SPE(Q) statistics were used together to represent the current system status. Kernel density estimation was adopted to calculate the control limits and determining the boundaries of normal and faulty for each layer and statistic. Bayesian inference based fusion method was used to combine all layers of statistical information and to draw the final decision. We denote the monitoring statistics of six consecutive samples exceeding the control limit as a fault in the deep methods to avoid the influence of noise empirically.
All data were standardized at first to ensure the contribution of each variable is equal when reducing dimensionality. Standardization involves two steps. The first is to subtract the sample mean of each variable. The next is to divide the variable after subtracting the sample mean by its standard deviation, and scale each variable to unit variance. Three indices were used to represent the performance of the fault detection methods. The first is called fault detection rate (FDR). The ratio of a method can detect a fault in faulty condition. The second is called fault detection time (FDT). It is how much time a method need to first detect the fault in faulty condition. The last is false alarming rate (FAR), its the ratio of a method can wrongly detect a fault in normal condition. Fault detection rate shows the accurateness of detection method. Fault detection time shows the sensitivity and promptness. False alarming rate shows the robustness of a fault detection method. Therefore, a fault detection method is more reliable than other methods if the FDR of this method is numerically higher and the FDT and FAR is numerically lower than those of others. Table 1 shows the fault detection results of 21 TE faults using DPCA, KPCA, DePCA and DeDPCA methods. To facilitate comparison, we divide the 21 TE faults in to three parts. The first part including F1, F2, F4, F6, F7, F8, F12, F13, F14, F17 and F18 as they can be easily detected. The FDRs of these faults using KPCA and DPCA all greater than 90%. The second part including F5, F10, F11, F16, F19, F20 and F21. These faults can be detected most of the time using KPCA and DPCA. The third part including F3, F9 and F15. F3, F9 and F15 are the faults most difficult to detect in TE process because there are basically no different between the F3, F9 and F15 faulty samples and normal samples.

Results and analysis
To appreciate the superior of multi-layer feature extraction, single-layer methods DPCA and KPCA are first compared with multi-layer methods DePCA and DeDPCA. In the first part faults, all the four fault detection methods can easily detect these faults as their FDRs are close to 100%. But DePCA and DeDPCA can detect the faults earlier than DPCA and KPCA. Besides, DePCA and DeD-PCA have lower false alarming rates than DPCA and KPCA. As the confidence level was set to 95%, the ideal FAR is less than 5%. But the average FAR of DPCA is over 5%. In the first part faults, some faults that are not easily be detected by DPCA or KPCA but can easily be detected by DePCA and DeDPCA especially for F5, F10, F16, F19 and F20. In the second part faults, all these four methods can hardly detect these faults. Overall, multi-layer methods can detect more faults, find faults earlier and cause less false alarming rates.
The largest fault detection rate of each fault is in bold. For DePCA and DeDPCA, DeDPCA gets the larger FDRs for most faults except fault F21, F3 and F9, while the FDRs of F21, F3 and F9 for DePCA and DeDPCA are very similar. It can be concluded that DeDPCA is more sensitive to faults in terms of FDRs. For fault detection time, most faults can be detected using less than 10 samples time in first part and the mean FDT for DePCA and DeDPCA are only one sample point difference. In the second part, DeDPCA can detect faults 4 samples earlier than DePCA for average. The most obvious difference is F20, which need 65 samples time to detect using DePCA but 17 samples time using DeDPCA. In the third part, it is the same for DePCA and DeDPCA to detect fault F3 and F9. However, DeDPCA can greatly reduce the time to detect f5. For false alarming rate, they are similar and both less than 5% for average. Thus DeDPCA gets larger averageFDR and smaller average FDT, indicating that DeDPCA is a more efficient method for fault detection. These results validate the benefits of extending the time series for fault detection, rather than not consider timing dependencies.
With the aim of verifying the current deep model gets more excellent performance, we designed another deep model which also usd DPCA for dynamic feature  Table 2. It is apparent to see that the average FDR of DeDPCA1 is larger than that of DeDPCA2 over 4%. Although the average FDT of DeDPCA1 is later than that of DeDPCA2, DeDPCA1 causes lower false alarming. Therefore extracting the dynamic features at first layer seems to be more reasonable. Otherwise extracting the dynamic features at second layer may loss some dynamic information due to the nonlinear transformation and independent assumption of the first KPCA layer.
The effectiveness of proposed model has been verified above. However, because we use a time lag shift strategy, the number of variables increases which costs more time in calculation. To solve this, we changed the way of time lag selection, and we called the new method improved DeDPCA. Table 3 demonstrates the average fault detection results and operation time of 21 TE faults using former DeDPCA and improved DeDPCA. The computation time for a sample was relatively recorded 10 times

Summary of outcomes
A new hierarchical statistical model structure namely DeDPCA was designed based on DPCA and KPCA algorithms. The proposed method contains two layers, hence DPCA and KPCA. The DPCA extracts the linear dynamic features while the KPCA accounts for the nonlinearities of the process. Bayesian inference was adopted to combine features of different layers and conduct the overall decision. The proposed method has been employed to the bench mark TE process for its application and validation. The monitoring performance of the proposed DeD-PCA was compared with that of KPCA, DPCA and DePCA. FDR, FDT and FAR were used to evaluate the monitoring performance of the proposed method with that of all other methods considered in this study. The result is that the DeDPCA attained higher FDR and earlier FDT and smaller FAR than DePCA. Besides, the necessity of extracting dynamic features in first layer was verified and a new timing extension technology was used to reduce the calculation complexity.

Limitations of the work
A limitation of this work may be that it is difficult to compute the contributions of each variable for deep-layer monitoring statistics. They are calculated after several mappings from original variables.
Another limitation is that DPCA is the most simple and straight forward method for dynamic process monitoring. Although DPCA can detect faults, because lagged variables are considered, diagnosis of abnormal behaviour becomes more complicated (Treasure et al., 2004). Besides, Negiz and Çinar (1998) noted that principal components extracted by this method are not exactly the minimal dynamic representations. It would also influence the nonlinear feature extraction in the second layer.

Further research
Future work could associate with determining the number of feature layer for data mining and how to arranging the relation between different layers. Moreover, determining the variables that causing the fault using contribution map may be one of the potential work.

Conclusions
Conventional fault detection methods are normally with one-singlelayer of feature extraction, which cannot extract sufficient information. Although the deep fault detection methods was proposed recently, the methods simply extract deep features and linear features. They assumed unwisely that the variables are time independent. However, most industrial production is nonlinear and dynamic, which is different from conventional methods. The paper here proposes a solid integration between deep fault detection and dynamic methods. It provides a novel method to monitor the safety of complex nonlinear dynamic system.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This work was supported by the Institute of Zhejiang University-Quzhou Science and Technology Project (IZQ2019-KJ-021).