Batch process control and monitoring: a Dual STATIS and Parallel Coordinates (DS-PC) approach

ABSTRACT Multivariate data collected from batches is usually monitored via control charts (CCs) based on MPCA and MPLS for batch to batch comparison. In addition, distribution free approaches include other dimensionality reduction methods for batch and time-wise analysis. However, techniques for multivariate data focused on variable-wise analysis haven’t been widely developed. Here, we propose a nonparametric quality control strategy for off-line monitoring of batches and variables, besides visual clustering of observations within batches. In our approach, CCs based on Dual STATIS are created using robust bagplots to enhance signal detection in batch and variable-wise analysis, while parallel coordinate plots are used in identification of unusual observations’ behavior per variable, regardless distributional assumptions. This proposed strategy poses the main advantage of detecting different type of changes through meaningful visualization tools, allowing easier interpretation of results in industrial settings.


Introduction
Nowadays, several industries rely on batch processing, including for example, chemicals, pharmaceuticals, food products, fabrics, metals, bakeries, pulp, and paper manufacturers. The batch is the primary processing unit in which raw materials are loaded into and then a series of transformations take place to yield final products. High-quality products are commonly described by quality characteristics (variables), each of which must be controlled within specifications to maintain customer satisfaction and to describe the process performance as the batch progresses (Lewis, 2014;Niang, Fogliatto, & Saporta, 2013).
Statistical process control (SPC) for monitoring industrial processes always involves simultaneous monitoring or control of two or more correlated quality-process characteristics. In this sense, multivariate control charts (CCs) replace the univariate ones, CONTACT Miriam Ramos-Barberán mvramosb@espol.edu.ec Facultad de Ciencias Naturales y Matemáticas, ESPOL, Escuela Superior Politécnica del Litoral, Campus Gustavo Galindo Km. 30.5 Vía Perimetral, Guayaquil P.O. Box 09-01-5863, Ecuador normal and abnormal operating conditions in hazard identification (Comfort, Warner, Vargo, & Bass, 2011). Its latest contribution in process monitoring was presented in combination with PCA technique, which delivers visualization of confidence regions, evaluation of models with different number of principal components, comparison of faulty events, and determination of false alarms' frequency in actual process data (Dunia, Edgar, & Nixon, 2012).
In this paper, we propose a nonparametric quality control strategy based on IS and CO CCs which enables off-line monitoring of batch processes using Dual STATIS and bagplots for establishing control regions. A complementary analysis is developed with parallel coordinate plotting to examine tendencies within out-of-control batches. Our work extends recent approaches where the use of the STATIS method has focused on time-wise analysis. In this sense, our combined strategy brings on a variable-wise analysis, leading to support the visual interpretation of out-of-control signals. Then, industries will be able to determine if quality specifications are met, resulting in cost and time savings.
The remaining sections are organized as follows. A brief review of STATIS, Dual STATIS and Parallel Coordinates is provided in Sections 2.1 and 2.2. The proposed DS-PC strategy is detailed in Section 2.3. Application of the strategy on two case studies is described in Sections 3.1 and 3.2. Finally, discussion and conclusions are delivered in Sections 4 and 5.

STATIS & Dual STATIS
STATIS, an acronym which stands for the French expression 'Structuration des Tableaux À Trois Indices de la Statistique', was initially proposed by L'Hermier des Plantes (1976) and later described by Escoufier (1987) and Lavit, Escoufier, Sabatier, and Traissac (1994). This multivariate data analysis method is aimed at exploring and analyzing the structure of several data tables obtained under different scenarios. The multiple data tables are sets of variables' measurements on the same observations. The method reduces the dimensionality of three-way data arrays through a similarity measure based on Euclidean distances between points' configurations. A recent review on this method and its extensions is available in Abdi, Williams, Valentin, and Bennani-Dosse (2012).
In opposition to STATIS, Dual STATIS works in such way that individuals vary from one scenario to another, remaining the same variables. Dimensionality reduction performed via EVD, SVD and GSVD, allows for the visualization of the salient features of tables and variables. Data consist of K sets of observations (tables) measured on the same set of variables. Here, instead of computing K cross-product matrices between the observations, we compute K covariance matrices between the variables (one per set of observations). Then, Dual STATIS approach follows the same steps as standard STATIS and will provide a compromise map for the variables (instead of the observations) (Abdi et al., 2012).
Centering, scaling, and normalizing are typically based on per-table statistics. Nonetheless, for quality control purposes, centering and scaling are performed with the global mean and standard deviation of the variables for all K tables, thus capturing the under-control behavior. This is possible because all variables are shared among the tables. Since the tables have different number of rows, they are normalized dividing each table by its number of rows.
First step: interstructure analysis In this way, K preprocessed matrices X 1 ; . . . ; X K with dimension I k Â J can be obtained. These matrices are vertically stacked to structure the X matrix with I rows and J columns, where I ¼ I 1 þ . . . þ I K . X exhibits horizontal partitioning which makes sense for Dual STATIS cases. For each X k table, a J Â J cross-product matrix is obtained: Each cross-product symmetric matrix S k is vectorized (converted in a vector by placing its rows one after another). Then, the obtained vectors are stored in a new matrix Z of dimension K Â J 2 .
The Interstructure matrix C of dimension K Â K is computed from the matrix product ZZ T . This positive semidefinite matrix allows representing in two dimensions each one of K tables using Eigenvalue Decomposition (EVD), considering that eigenvalues are sorted from the largest to the smallest: where U is a matrix containing the eigenvectors of C and Θ is a diagonal matrix which collects the corresponding eigenvalues. An optional way for computing U is performing a Singular Value Decomposition (SVD) of the matrix Z: Let L be the rank of the matrix Z, U is a K Â L matrix containing the left singular vectors of Z, Γ a L Â L diagonal matrix which collects the singular values of Z, and V stores the right singular vectors of Z in a J 2 Â L matrix. EVD of C allows to represent tables as points in a two-dimensional PCA map called Interstructure Space (IS), using first and second columns of the K Â L matrix as coordinates: This particular representation, also called biplot, was first introduced by Gabriel (1971) and later modified by Galindo (1986). A biplot allows information on both individuals and variables of a data matrix to be displayed graphically. The IS CC is based on this plot.
A new preprocessed table X Kþ1 can be projected onto the interstructure space by computing a supplementary cross-product matrix S Kþ1 , vectorizing it, and generating the correspondent Z Kþ1 matrix of dimension 1 Â J 2 . Then, the associated point in the interstructure space is determined using first and second columns of the 1 Â L matrix as coordinates: From the interstructure space, an optimal set of weights for the tables is derived to compute the best common representation of the J variables (Abdi et al., 2012). These weights are determined by rescaling the first eigenvector u 1 of C and dividing it by the sumû 1 of its elements, defining the vector m as: Then, the optimum weights' diagonal matrix M is computed as follows. The m i values from m are stacked and each value is repeated I k times according to the number of observations in each table. Note that M is a square matrix of order I ¼ I 1 þ . . . þ I K .
Second step: intrastructure analysis A mass, denoted α j , is assigned to each variable. Masses are nonnegative elements whose sum equals one. Although different masses can be computed for each variable, often equal masses are chosen to assure that all variables have the same importance for analysis (Abdi et al., 2012). A diagonal matrix A for variables' masses is obtained dividing the identity matrix I JÂJ by the number of variables J: Now the triplet X; M; A ð Þis conformed by preprocessed tables, weights, and masses. Using the general singular value decomposition (GSVD) of the triplet, the following decomposition is computed: Let T be the rank of the matrix X, P is a I Â T matrix containing the left singular vectors of X, Δ a diagonal T Â T matrix which collects its singular values, and Q stores the right singular vectors in a J Â T matrix. This decomposition will provide a Compromise Space for the representation of variables into a reduced dimension map, which constitutes the basis for CO CCs.
Compromise Factor Scores allow plotting J variables into the compromise map. These scores are computed as the first and second columns of the J Â T matrix: Recall that partial factor scores can be defined as the projection of k-th table onto its left singular vectors (Abdi et al., 2012). Then, the partial factor scores for this table are computed as the first and second columns of the matrix: As suggested in Equation 9, the compromise factor scores matrix F is also the weighted average of the partial factor scores and is considered as the barycenter (Abdi et al., 2012). Therefore, for computing the partial factor scores, I Â T matrix P is horizontally partitioned in K matrices of dimension I k Â T denoted P k with the same number of rows that X k . Matrix P k is calculated from Equation 8 as described: Variables of a new preprocessed table X Kþ1 can be projected onto the compromise space by obtaining the following matrices: Then, the associated points in the compromise space are determined using first and second columns of the J Â T matrix F Kþ1 as coordinates. In CO CCs, these points are compared against the partial factor scores of the K tables to detect out-of-control cases. Dual STATIS steps are summarized in the scheme shown in Figure 1.

Parallel Coordinates
Parallel Coordinates as a graphical representation system was proposed by Alfred Inselberg from which n-dimensions could be represented in a two-dimensional system (Inselberg & Dimsdale, 1990). As Dual STATIS, parallel coordinate plotting is used to display multi-dimensional data in two dimensions in order to identify homogeneous patterns of data among many different variables (Heinrich & Weiskopf, 2013). When lines present a similar behavior, this tool is very useful in identifying clusters between observations (Inselberg, 2009). In this sense, Parallel Coordinates can be used for 'visual clustering', i.e. to find groups of similar points based on visual features such as the proximity of lines or line density (Heinrich & Weiskopf, 2013). For control and monitoring, Parallel Coordinates capacity for clustering and summarization are of great importance in plotting confidence regions (Dunia et al., 2012). Each observation within K tables is represented in J dimensions or variables by a polygonal of J À 1 sides. In addition, the correlation analysis between J variables focuses on intersections from polygonals that are constructed from raw K tables (Inselberg, 2009;Wegman, 1990).

DS-PC strategy
DS-PC strategy stands for the combination of Dual STATIS and Parallel Coordinates. The proposal ( Figure 2) is based on nonparametric CCs provided by DS and plots in two-dimensional system derived from PC.
During the pilot phase, the data set must be preprocessed to structure K reference batches, as described previously on zero step. This preprocessing is vital for the later detection of out-of-control signals. Afterwards, Dual STATIS analyses are computed to obtain the interstructure matrix G, the compromise matrix F, and the partial factor scores F k matrix per batch. This information is crucial for setting up IS and CO j CCs.
The charting procedure begins with the construction of control regions, which are based on bagplots (Rousseeuw, Ruts, & Tukey, 1999). The bagplot represents a bivariate generalization of the univariate boxplot in which 'depth median' plays the role of the robust centroid and it is surrounded by a convex polygon that defines the control region for the referred charts ( Figure 3).
Input information for control regions is provided by matrices G and F k , calculated from in-control batches. The two first columns of matrix G represent batches' position in the interstructure space (IS) and the two first columns of matrix F k exhibit the coordinates of each batch per variable (defined by its corresponding j row) into the compromise space (CO j ). Figure 1. Dual STATIS representation including interstructure and intrastructure analysis: The first step also known as interstructure analysis is represented by gray arrows while the second step, commonly referred as intrastructure analysis is represented by light blue arrows. Zero step is not shown.
The initial step, for building up control regions, consists in plotting input information as data points. Now, the median (symbolized with *) is calculated from the sketched data points and a bag containing 50% of these data points is plotted. Then, bag's magnification by a factor of three yields the fence, which separates in-control and out-of-control states (outliers) (Gower, Gardner-Lubbe, & le Roux, 2011). This factor ensures that 99.8% of the data is included in the control region giving a probability α ¼ 0:002 of desired false alarms, under normality assumption (Zani, Riani, & Corbellini, 1998). The fence determines the 1 À α control region performed directly on principal plane displays obtained using Dual STATIS. Finally, the fence is not considered as part of the bagplot, hence data points that are not marked as outliers are surrounded by a loop, the convex polygon of the observations within the fence (Narayanachar, 2013;Rousseeuw et al., 1999). This algorithm is illustrated in Figure 3.
These control regions test our bivariate statistics calculated for new structured tables during the monitoring phase from new batches, leading to IS and CO j CCs. Each table is then projected onto interstructure and compromise spaces through matrices G Kþ1 and F Kþ1 .
Off-line control of a new batch K þ 1 starts with the projection of the corresponding data matrix X Kþ1 into IS CC using Equations (1) and (5). It is important to remark that information of the new batch will be entirely available and has already been preprocessed. In addition, new batch's projection into CO j CC is done using Equation (12). At the same time, parallel coordinate plotting is done with the information of several reference batches and the new one to distinguish any abnormal performance on variables' axes.

Case studies
To evaluate the performance of the proposed strategy, two case studies are developed in this section. The first one is an illustrative example that simulates a batch process. The second one is the fed-batch penicillin fermentation process, a well-known benchmark batch process. All simulations and analyses were computed in open source software R.

Illustrative example
In plastic bags fabrication industry, data of 200 reference batches (K ¼ 200) were simulated. For each batch, three process variables (J ¼ 3) and 50 individuals (I ¼ 50) were recorded. The means and standard deviations were chosen from expected standards, and the correlation matrix was proposed by the authors to represent normal behavior of process. These values are shown in Table 1.
The correlated structure of the data was introduced through Cholesky decomposition of variance and covariance matrix (Golub & Van Loan, 1996;Horn & Johnson, 2013;Trefethen & Bau, 1997). In this sense, the reference data set was formed by random batches that follow a multivariate normal distribution.
Additionally, a set of eight batches was simulated to exemplify the performance of the DS-PC strategy indicated in Section 2.3. This set consisted of one good batch (in control) and seven bad batches (out of control). Variables were perturbed to simulate bad batches according to mean, standard deviation, and correlation shifts, shown in Table 2. Some batches introduced individual shifts and others exhibited a combination of different shifts.
After analyzing the 200 reference data sets and 8 training batches using Dual STATIS method, graphical representations of interstructure and compromise behavior were obtained.
IS CC shows training batches' projection in Figure 7. As expected, batch B 1 falls within the control region while remaining batches from B 2 to B 8 are signaled to be out of control. The next step in our strategy is to plot CO j CCs to investigate which variables are responsible for the signals. To facilitate the analysis, a graph, equally scaled, has been Figure 7. IS and CO j control charts for the illustrative example: Points in black correspond to batches without shifts. The color used for batches exhibiting different levels of shifts is blue. The IS and CO j CCs were plotted with 7 batches signaled as out of control. Figure 8. Parallel coordinate plots for the illustrative example: Observations within reference behavior are plotted in gray and the ones in blue represent new observations coming from ongoing batches during the process. Batch B 1 represents no shifts; remaining batches portray mean, standard deviation, correlation shifts, and shifts' combination in three variables. plotted for each variable as shown in Figure 7. CO j CCs signal the three variables in an out-of-control state.
Bad batches show anomalous behavior within the three variables analyzed. However, this visual aid doesn't allow to examine the types of shifts that batches suffered. Complementary analysis carried out with Parallel Coordinates is shown in Figure 8. This technique allows comparison in terms of observations' tendencies from several reference batches and the ones derived from the batches out of control.
Batch B 1 presents a normal pattern within boundaries defined by reference batches. However, B 2 exhibits values above the mean, which evidences an increase for the three variables tested. B 3 shows confidently standard deviation shifts with values falling out of the reference pattern. Although, almost linear positive correlation between variables is presented in B 1 , intersection points between polygonals demonstrate a linear negative correlation in batch B 4 , reflecting greater changes in correlation structure. Batch B 5 exhibits changes for the three variables because of values under the mean and with smaller deviation than usual. B 6 also presents values with less deviation than the normal pattern and an unnoticeable decrease in correlation coefficients that presents a limitation for its visualization due to combined situations. Batch B 7 has values under the mean and evidences moderate changes in correlation due to a negative factor, yet this shift is not easily detected. Finally, B 8 presents three types of shifts, values above the mean, greater standard deviations, and change of sign in correlation coefficients, being the last one more noticeable than in batch B 7 .

Simulations
As suggested by Filho and Luna (2015), for an overall examination of IS and CO j CCs efficiency, 11 simulated scenarios with 10,000 runs each, were computed. Results are shown in Table 3.
Scenario with no changes, presents a low percentage of false alarms in both charts. However, mean shifts of ±4% and ±6% evidence high percentage of signal detection in all perturbed variables. Standard deviation shifts of ±20% and ±40% in three variables, reveal greater detection in greater shifts. When the correlations are affected in a similar In general, combined shifts increase the percentage of out-of-control cases' detection when mean shifts were involved due to high sensibility for detecting this type of changes. However, standard deviation and correlation associated shifts exhibit better detection than correlation shifts individually when maintaining coefficients' sign. Finally, scenarios with the three types of shift produce better detection in comparison with those where two types of shift were introduced. Although performing efficiently in several scenarios as presented above, some aspects of STATIS-based CCs may require further refinements in the method and complementary analysis.

Power curves comparison
Hundred sequential factors were tested in performance simulations (1,000 runs per each) for evaluating different levels of proportional shifts for mean, standard deviation, and correlation coefficients in the IS CC. Figures 9-11 show the percentage of out-ofcontrol cases vs. the factor for defining the mentioned shifts. The green line corresponds to the use of bagplot with depth median as robust centroid for setting up the control region and the red line to the use of convex hull peeling and B-spline smoothing developed in Niang et al. (2009). Figure 9 shows that bagplots classify good and bad batches better than convex hull, leading to smaller number of false alarms within ±1.5 % of mean shifts and total discard when shifts are higher than ±4%. Figure 10 evidences a slight asymmetry to the left, meaning bagplot performs better than convex hull detecting bigger dispersion rapidly. Although, any industrial process seeks for the smallest possible dispersion, detection of the opposite situation is significantly privileged. Factors less than 0.7 and more than 1.5 lead to a total discard of the batch in both cases. In Figure 11, a proportional correlation shift is not easily detected in bagplots; as is observed, convex hull performs better. However, they both have a total discard when coefficients shift in the negative direction is greater than 50% with a maximum shift of 100% (factor equals −1).

Fed-batch penicillin fermentation process
In this section, the fed-batch penicillin fermentation process is used as a practical application of DS-PC approach. This well-known benchmark simulation process is widely applied to evaluate the batch process monitoring and fault diagnosis methods (Albazzaz, 2004;Birol, Ündey, & Çinar, 2002;Doan & Srinivasan, 2008;Hanyuan & Xuemin, 2017;Yoo, Lee, Vanrolleghem, & Lee, 2004). Its flow diagram is shown in Figure 12. This process involves two operation stages including the preculture stage and the fed-batch stage. During the initial phase, necessary cell mass is generated, and penicillin cells are produced. During the final phase, glucose is supplied into the system to maintain high penicillin production (Hanyuan & Xuemin, 2017).
In this study, a simulator PenSim is used to generate data of the penicillin fermentation process (Birol et al., 2002). Every generated batch has the information of 400 h with a sample rate of 0.5 h, leading to data sets of 800 observations. For algorithm evaluation, we monitored only the last 50 observations from 13 selected variables within the fed-batch stage. Reference data, including 100 batches, were generated considering normal operating conditions. Furthermore, seven batches (two good and five faulty)  were simulated. The faulty ones were generated by altering set points of five variables. Simulation information is detailed in Table 4. Batches are projected onto the IS CC showing five faulty batches outside the control region ( Figure 13). In CO j CCs, faulty batches are detected as out-of-control primarily in CCs corresponding to the perturbed variables. However, several batches whose variables didn't suffer any modification, also signal out as faulty because of the correlation structure within the monitored variables ( Figure 13).
The described tendency can also be corroborated with parallel coordinate plotting, as shown in Figure 14, where just one good batch is plotted.

Discussion
Our approach extends the work of Scepi (2002), where the use of the STATIS method in multivariate quality control was initially proposed. However, at least four Figure 13. IS and CO j control charts for the fed-batch penicillin fermentation process. Points in black correspond to batches without shifts. The color used for batches exhibiting different levels of shifts is blue. The IS CCs showed five faulty batches, while the CO j CCs evidenced several batches out of control according to perturbed variables and its correlation among other ones considered as part of the process. contributions separate our proposal from the one presented in her work and others of similar importance. The first is related to preprocessing techniques applied to the raw data, the second is the DS-PC strategy to define the nonparametric control region, the third is the advantage of detecting different types of changes with the same CCs, and the fourth is the use of a well-known visual clustering method in multivariate process control as a diagnosis method for the out-of-control points signalized in the proposed CCs.
First, similar applications of STATIS extensions already published, consider for centering and later scaling the mean and standard deviation of each batch (Abdi et al., 2012;Filho & Luna, 2015) or simply do not include any criteria for this purpose, focusing on the normalization procedure . Instead, for our purpose, the data preprocessing is made considering the global mean and standard deviation of reference batches, complemented with normalization criteria. The normalization for all tables was performed dividing by number of rows of each table, which is the same in this example of Dual STATIS. This crucial step is extremely necessary to improve efficient detection of signals out of control (Abdi et al., 2012).
Second, the use of bagplot in quality control as a variant of the well-known convex hull peeling and B-spline smoothing is introduced in our proposal. Zani et al. (1998) applied convex hull peeling, which is somewhat less robust than half-space depth, as shown by Donoho and Gasko (1992). The notion of bagplots is based in Tukey's halfspace depth which has no restriction for any dimensions (Kruppa & Jung, 2017;Rousseeuw et al., 1999). The factor with which the bagplot is plotted poses another advantage of our proposal because the occurrence probability of false alarms is theoretically reduced to 0:002 (Rousseeuw et al., 1999) leading in reduction of loss due to shrinkage.
Third, typically multivariate CCs' detection of points out of control is due to several changes in the data, leading to complementary univariate analysis in order to reveal the type of shift responsible for the signal. In this sense, the use of several CCs is mandatory to address answers about shifts in mean, variability, and correlation structure (Bersimis et al., 2007;Montgomery, 2009). However, our work proposed here allows mean, standard deviation, and correlation shifts' detection with one CC, IS CC for batches and CO j CC for variables, where higher shifts produce greater detection.
Fourth, an original combination of an alternative analysis belonging to STATIS toolbox with Parallel Coordinates, is presented as a visual tool for detecting batches and variables out of control and identifying atypical behavior. The summarization of large multivariate data sets of both methods is enhanced due to control and monitoring application given here. Also, STATIS-based CCs plus parallel coordinate plots offer a representation, considering cross correlations of process variables besides means and standard deviations. Each shift can be interpreted as a result of different special or assignable causes depending on the process. Besides, such tools signalize out-of-control states precisely, leading to a smaller number of false alarms. As for our knowledge, similar studies haven't considered this useful combination. Literature is full of techniques applied individually (Comfort et al., 2011;Dunia et al., 2012;Filho & Luna, 2015;Niang et al., 2013Niang et al., , 2009).
Lately, research published in multivariate process control refers to time-wise analysis considering time instants within the batches (Filho & Luna, 2015;Niang et al., 2013Niang et al., , 2009). These authors have already stated the variable-wise analysis as future research. However, none of them have already implemented the combined approach stated here. Other advantages of the proposed method are that the control and monitoring scheme for batch processes is nonparametric unlike other procedures, which widens its applicability to any case, without stablishing the data distribution. Usually, real data doesn't follow an easy-to-use distribution making it difficult to control and monitor any process (Bersimis et al., 2007;Lewis, 2014).
Natural extensions of the work presented here include (i) a comparative study of IS and CO j CCs obtained using the proposed method and others available in literature in terms of sensibility and statistical power; (ii) refinement of the method when shifts are introduced in several variables but not all of them; and (iii) development of the proposed charts to handle batches with missing values, which are common in practice. Research on real-time (on-line) variable-wise monitoring of batches, which was not addressed in our methodology, is currently in progress.

Conclusion
This paper presents a new approach to control multivariate processes in batch production, mixing Dual STATIS and Parallel Coordinates, without considering special constraints of data distribution. In addition, the construction of IS and CO j CCs using robust bagplots for setting up control regions, allows to visualize changes in location, spread, and relationships between variables of data, signalizing greater fault detection as higher shifts take place. DS-PC strategy turns into a powerful tool for improving corrective actions and making important decisions in risk management.
On the other hand, the visualization of reference and new batches in the same plot with different parallel axis provides for an efficient analysis resource when an unusual behavior is detected, to identify mean, standard deviation, and correlation shifts and to search for extreme values that can occur in process' variables during the production.
Although the control of a batch production process requires an elaborated statistical framework during the pilot and monitoring phase, there is no doubt that the original combination of visual resources proposed here, facilitate the inspection and identification of possible faulty batches and out-of-control variables, as previous step to the batches' release.

Disclosure statement
No potential conflict of interest was reported by the authors.