Causes detection of unqualified bendability of hot rolled strip via improved RankBoost with multiple feature ranking algorithms

ABSTRACT In hot rolling process, mechanical properties of steel materials are important to steel quality. The bendability is one of the key parameters to evaluate the formability of the strip. When the bendability is unqualified, how to detect causes becomes a big challenge. In this paper, a model to find the causes of bendability of hot rolled strip based on improved RankBoost with multiple feature selection algorithms using historical data is built. Firstly, the related process variables and bendability results are collected. And then, seven feature ranking methods including Fisher score, Relief, Gini index, T-test, Kruskal–Wallis, mutual information entropy and minimum redundancy maximum relevance (MRMR), are used to rank the significance of features individually. Finally, to summarize the results of the seven methods, the total importance of every feature can be obtained using the improved RankBoost method to select the most important features as the major causes. The real field data set from hot rolling strip steel process is used to validate the model. The results demonstrate that the RankBoost method can give a more credible result.

Hot rolled steel strip is manufactured through various production processes. Iron ore and coke are fed to the blast furnace to make iron. The blast furnace is a huge chemical reactor where reduction reactions take place. The iron is then sent to the steel producing making process where bloom is produced. The steel making process consists of converters for removing carbon, refiners for adjusting elements, and continuous casters. Then, the blooming process resizes bloom to slab for the next rolling process. The purpose of hot rolling is to turn reheated steel slabs into strips.
Bendability is one of the deformation modes in press forming (Lester, 1973). Therefore, one type of steel strip, which is mainly used for automotive parts, is required to have better bendability. In manufacturing there are some unqualified products. The physical models and finite element methods are used to analyse the relationship between the microstructure and bendability (Eiji et al., 2014). The analysis results can be used to design and control the bendability of strip. When unqualified products happen, it is hard to find the causes.
Modern hot rolling process is highly automated and often monitored by many sensors. The large amount of sensing data provides great opportunities for effective quality control of hot rolling process. Mechanical properties are influenced by chemical content, and all kinds CONTACT Fei He hefei@ustb.edu.cn of process parameters in manufacturing. When the product quality cannot satisfy the need of customer, fault diagnosis methods are used to detect the major causes. Multivariate statistical approaches for process monitoring and fault diagnosis have been rapidly developed in recent decades, mainly due to the adoption of powerful latent projection techniques such as principal components analysis (PCA) and partial least squares (PLS) (Sharma et al., 2013). To cope with nonlinearity problem, the principal curve based on neural network and kernel PCA (KPCA), kernel PLS (KPLS) (Liu et al., 2011;Samuel & Cao, 2016) are proposed. When the process data from normal process is supplied, PCA and KPCA methods can be used for process monitoring and detection causes. When the process data and continuous quality data from normal process is collected, PLS and KPLS methods can be used. Bendability value is tested offline and there is a delay of several hours. Sometime there is a batch of unqualified products, and how to determine the cause quickly becomes a big challenge. Now, there are process data and discrete quality, we propose that feature selection methods are used to judge the significance of features for classification. The features that can separate the different classes clearly are selected preferentially. If the importance of the features can be ranked, the first several features can be considered as the causes of the unqualified products. There are a lot of feature selection methods can be used, but we do not know which one can give the correct significance analysis. In this paper, Fisher score, Relief, Gini index, Ttest, Kruskal-Wallis, mutual information entropy and minimum redundancy maximum relevance (MRMR) are used to rank the significance of features individually. The contribution of this paper lies in the following two aspects: in order to select the most important features as the major causes, an improved RankBoost method is given by combining the other seven feature selection methods, and the total significance of every feature can be obtained that considers the results of every method. On the other hand, the improved method is applied to tackle the cause detection of unqualified bendability.

Introduction of hot strip rolling process
The purpose of hot rolling is to turn reheated steel slabs into strips. A hot strip line is always composed of reheat furnaces, a roughing mill, several finishing mills, and two coilers. In the roughing mill, the reheated slabs are reduced to a thickness of 25-50 mm and narrowed to the desired width. The resulting sheet slab is then transported to the finishing mill, where it is further reduced to the final thickness of 1-20 mm. The resulting strip is then coiled to form the finished coil of steel strip. In the finishing rolling, to achieve the required reduction, final qualities and tolerances, several passes of rolling are executed by tandem rolling with six or seven successive stands in the presence of inter-stand tension. A simplified schematic diagram of a steel rolling mill for the production of coil plate is presented in Figure 1. It shows the route of slabs from entry at the reheat furnaces to their exit at the coiler. The process route can best be described in terms of the major items of equipment (Bissessur et al., 2000;Wang & He, 2019).
(1) Reheat furnace: The feed stock for the rolling mill are slabs produced by the continuous casting process in a steel plant. These are normally supplied at ambient temperature. The purpose of the furnace is to raise the temperature of the whole slab to around 1300°C. (2) Roughing mill: This is a reversing mill that produces a breakdown slab (the product between the roughing mill and the finishing mill) by rolling the slab through a series of forward and reverse passes, typically reducing the slab thickness from 200 to 30 mm. (3) Finishing mill: This is designed to reduce the gauge (thickness) of the breakdown slab to that of the finished strip, while maintaining the desired thickness. A sequential of combination of stands is used, e.g. six to seven. The mill control system is critical as constant mass flow that must be maintained in all stands to ensure continuous production. (4) Down coiler: On exit from the mill, the hot strip typically has a velocity of up to 10 m/s and can be hundreds of metres in length. The down coiler allows the strip to be converted into a coil.

Bendability of hot strip
In hot strip rolling, mechanical properties of steel materials are important to steel quality which are detected offline and destructively. The main parameters of the mechanical properties include elongation rate, yield point, and tensile strength, which are continuous values. The bendability is one of the key parameters to evaluate the formability of the strip. Bending test is used to evaluate how easy it is to form by bending with approximate plane strain deformation and crack generation is checked after carrying out specified radius bending as shown in Figure 2 (Mertin et al., 2019). And then, the bendability can be described as qualified or unqualified. In order to detect causes using historical data, process variables that maybe impact the bendability quality as described in Table 1 according to expert knowledge.

Fisher score
In the Fisher score method, given training vectors if the numbers of positive and negative instances are n+ and n-respectively, then the fisher score of the ith feature is explained as follows (Chen & Lin, 2006;Kemal & Volkan, 2011): are the average of the ith feature of the whole, positive and negative data sets, respectively. x (+) ki is the ith feature of the kth positive instance, and x (−) ki is the ith feature of the kth negative instance. The numerator presents the discrimination between the positive and negative sets, and the denomination explains the one within each of the two sets. The larger the Fisher score F(i) is, the more likely this feature is more discriminative (Duda et al., 2001;Gunes et al., 2010).

Relief
The key idea of Relief is to estimate the quality of attributes according to how well their values distinguish between instances that are near to each other. For that purpose, feature weights are iteratively estimated according to their ability to discriminate between neighbouring patterns (Aldehim & Wang, 2015). In each iteration, a data point x is randomly selected and then two nearest neighbours of x are found, one from the same class (termed the nearest hit or NH) and the other from a different class (termed the nearest miss or NM). The weight of the ith feature is then updated (Kira & Rendell, 1992): (2) where i respects the ith feature, x is randomly selected data point, m is the sample size, and diff () is the distance between samples. HM(x) and NH(x) are nearest neighbour sample points with the same class and different class, respectively. Then every weight is calculated through T iterations. The detail of Relief algorithm is depicted as (Aldehim & Wang, 2015;Kira & Rendell, 1992;Kononenko, 1994).

Gini index
Suppose that x i is ith feature of data sets, its class label attribute has two different values, which defines different classes of C j (j = 1, 2). According to the class label attribute value, x i can be divided into two subsets. If x (j) i is the subset of samples belongs to class C j , and m i is the number of the samples in the subset x (j) i , then the Gini index of set x i is (Shang et al., 2006;Zhu & Lin, 2013) Gini where P i is the probability of any sample of C j , which estimated by m i /m. When the minimum of Gini(x i ) is 0 that mean all records belong to the same category at this collection; it indicates the maximum useful information can be obtained. When all the samples in this collection have uniform distribution for certain category, Gini(x i ) reaches maximum, it indicates the minimum useful information obtained. The form of the Gini index is used to measure the 'impurity' of attribute for categorization. The smaller its value that means lesser 'impurity', the better attribute. Then, the Gini index of every feature is computed and sorted in ascending order.

T-test
When we want to compare the difference between two set, T-test is used to test whether the mean values are different. Define the data set X which has two classes C 1 and C 2 . We use m 1 , v 1 stand for the mean and variance of features in class C 1 , and m 2 , v 2 for the mean and variance of features in class C 2 . Then, T-test statistics are as follows : In which, n 1 , n 2 are the sample size of C 1 and C 2 . Then, t is a vector in which t i represents T-test results of ith Feature. The ith is more significant when t i is larger. Thus, T-test statistics is computed and t i is sorted in descending order.

Kruskal-Wallis
The Kruskal-Wallis statistical test is a non-parametric test that makes no assumptions about the distribution of the data (e.g. normality) (Hollander & Wolfe, 1973). Many nonparametric test methods use data rank rather than raw values to calculate the statistic.
Let n 1 , n 2 , . . . , n K represent the sample sizes for each of the K classes. The total sample size is N = K k=1 n k . The combined sample is ranked and then, the sum of the ranks for the class k is computed as Kruskal-Wallis test statistic is calculated as below: If the null hypothesis of equal median holds, this test statistic corresponds approximately to a chi-square distribution with K − 1 degrees of freedom. The larger the test statistic H, the weaker the null hypothesis becomes, since a strong separation of the medians indicates that the feature under consideration has a high classification power (Cor et al., 2006).

Mutual information entropy
Information theory was conceptualized by Shannon to deal with the problem of optimally transmitting message over noisy channels. In information theory, entropy is regarded as a measure of information and Hartley called it the 'amount of information' (Principe, 2010). Since it is capable of quantifying the uncertainty of random variables and scaling the amount of information shared by them effectively, it has been widely used in many fields (Principe, 2010). Let X denote a random variable taking values in a finite set X = {x 1 , x 2 , . . . , x k , . . . x N } according to a probability distribution p(x k ); its uncertainty can be measured by entropy H(X), which is defined as Note that entropy does not depend on actual values, but just the probability distribution of random variable. The total decrease of uncertainty in X by observing Y is known as the mutual information between X and Y, which is defined as (Sylvain et al., 2008) The mutual information I(X, Y) is used to quantify how much information shared by two variables X and Y.

MRMR
Minimum redundancy maximum relevance is based on mutual information method . We propose to expand the space covered by the feature set by requiring that features are maximally dissimilar to each other, for example, their mutual Euclidean distances are maximized, or their pairwise correlations are minimized. These minimum redundancy criteria are of course supplemented by the usual maximum relevance criteria such as maximal mutual information with the targeted phenotypes. The benefits of this approach can be realized in two ways : (1) with the same number of features, the MRMR feature set is expected to be more representative of the targeted phenotypes, therefore leading to better generalization property; (2) equivalently, a smaller MRMR feature set can be used to effectively cover the same space that a larger conventional feature set does. Minimal redundancy will make the feature set a better representation of the entire dataset. The minimum redundancy condition is where I(x i , x j ) is used to represent mutual information between x i and x j . S denote the feature set. S is the number of features. The maximum relevance condition is to maximize the total relevance of all feature in S: where I(x i , y) represent mutual information between x i and y. The MRMR feature set is obtained by optimizing the conditions in Equations (8) and (9) simultaneously. Then the simplest combination criterion is considered as

Improved RankBoost with multiple feature ranking algorithms
Let X be a set called the feature space. These are many feature selection methods that can give rankings. The Ignore the dependence and correlation between variables. 5 Fisher score Independent calculation of feature scores and feature selection.
Ignore feature correlation, resulting in feature subsets that may be sub-optimal. 6 Kruskai-Wallis The sample data does not have to be normally distributed.
Not very convenient to calculate discrete variables.
7 Relief Strong versatility, low complexity, remove irrelevant features, suitable for large-scale data sets.
Independent of the specific learning algorithm. advantages and disadvantages of each feature selection method are shown in Table 2. Our goal is to combine a given set of feature ranking. A feature ranking is nothing more than an ordering of the features from most preferred to least preferred. In the paper the improved Rank-Boost method is used to combine the feature ranking results got from different methods as shown in Figure 3, instead of the mean processing method. We assume that n learning algorithms give n ranking features denoted as f 1 , f 2 , . . . , f n . Since each ranking feature f i defines a linear ordering of the features, we can equivalently think of f i as a scoring function where higher scores are assigned to more preferred feature. That is, we can represent any ranking feature as a real-valued function where f i (x 1 ) > f i (x 0 ) means that feature x 1 is preferred to x 0 by f i .
Note that, every feature selection method may give different ranking. So that, RankBoost method is introduced to combine all of the ranking order into a single ranking called the final ranking that can be represented by a function H. If H(x 1 ) > H(x 0 ), x 1 is preferred to x 0 . Rank-Boost is an iterative algorithm based on Adaboost to solve ranking problem. Like all boosting algorithms, RankBoost operates in rounds. The pseudo code is shown in Figure 4 (Freund et al., 2003). We assume access to a separate procedure called the weak learner that, on each round, is called to produce a weak ranking. RankBoost maintains a distribution D t over X × X that is passed on round t to the weak learner. Intuitively, RankBoost chooses D t to emphasize different parts of the training data. A high weight assigned to a pair of features indicates a great importance that the weak learner order that pair correctly. The boosting algorithm uses the weak rankings to update the distribution. The weight is decreased if h t gives a correct ranking and increased otherwise. The final ranking H is a weighted sum of the weak rankings.
RankBoost Algorithm is used in combining all the feature selection methods as shown in Figure 5, the detail description as follows: Firstly, we get ranking features f 1 , f 2 , . . . , f n by the feature selection methods, which are used as weak learners on each round of RankBoost. Here, we can equivalently think of f i as a scoring function where f i (x 0 ) = m + 1 − idx(x 0 ). And m is the number of features, idx(x 0 ) represents the ranking index of x 0 in the linear ordering f i . Secondly, we start to combine them by RankBoost. The initial distribution D over X × X is needed here. Assume the function has the form : X × X → R. Here, (x 0 , x 1 ) > 0 means that x 1 should be ranked above x 0 while (x 0 , x 1 ) < 0 means the opposite; a value of zero indicates no preference between x 0 and x 1 , so (x 0 , x 1 ) = 0. To minimize the (weighted) number of pairs of features, let D(x 0 , x 1 ) = c · max{0, (x 0 , x 1 )}. Here, c is a positive constant chosen so that In this paper, we set function (x 0 , where F is the combination of the linear orderings f 1 , f 2 , . . . , f n by feature selection methods and F(x 0 ) has the same formation as where m is the number of features, idx(x 0 ) represents the ranking index of x 0 in the linear ordering f i . Here we use the average of ranking indices of each feature in f 1 , f 2 , . . . , f n to restart sorting, then we get F. Thirdly, the distribution D t is passed on round t to weak learner. On each round, we need to find the best one among all the weak leaners according to D t , which can minimize the ranking loss defined to be where * is 1 when * is true, and it is 0 when * is false. Then, we have ). So, we can minimize the Z t in each round to reduce the ranking loss. At the same time, the parameter α t is chosen. Suppose in the current round, the weak learner is where r = Z is minimized when which, plugging into Equation (13), yields Z ≤ √ 1 − r 2 . In particular, we will use weak rankings h of the form where the threshold θ ∈ {θ j } J j=1 is made up of different values in f i (x) and −∞, which is ascending ordered. So that Equation (14) has another form where x )). As the weak rankings h has the 0-1 form, we have (18) So, we can make the unknown parameter in |r| a group as (f i , θ). Change f i among all the scoring function by feature selection methods and θ ∈ {θ j } J j=1 mentioned above, we will get different values of |r|. When |r| gets the most, the f i is the best learner. Then bring the value of r into Equation (15), we get the α t .
Finally, update the distribution D t to D t+1 , which will be passed on to next round. Until the ranking loss tends to a lower stability point, the final ranking H is obtained.

Data set
There are 10 production process variables collected from the real hot strip rolling field, including thickness reduction ratio, rolling process temperature information(rough milling exit temperature, finishing entry temperature, finishing exit temperature and coiling temperature), content of chemical components(C, Si, Mn, P, S). Because the bendability is detected destructively, only one sample can be chosen to test bendability that represent the entire coil quality in each steel coil. The mean values of the process parameters in each steel coil are computed to correspond to the bendability. In all, 961 samples are collected, in which 890 samples come from qualified products process and the other 71 samples come from unqualified products process.
In order to summarize the dataset, the statistics of hot rolled strip data including maximum, minimization, average and standard deviation of every variable are shown in Table 3. From Table 3, each variable has different data range. Firstly, data normalization is used to deal with the raw data to remove the impact of different data ranges. Data normalization is that the raw data minus the mean and divided by the standard deviation.

Feature select process
Seven feature selection methods including Fisher score, Relief, Gini index, T-test, Kruskal-Wallis, mutual information entropy and Minimum redundancy maximum relevance, are used to rank the significance of features individually.
(1) Fisher score In the Fisher score method, the Fisher score F(i) of every feature is computed using Equation (1), and then is plotted as Figure 6. The larger the Fisher score F(i) is, the more discriminative this feature is. In Figure 3 the 3rd feature has the largest value, so finishing entry temperature is the most important feature in the Fisher score method.
(2) Relief In the Relief method, the weight of every feature is calculated using Equation (2). The larger the weight is, the more important the feature is. And then, the weights are described in Figure 7. Then we can get that the thickness reduction ratio is the most important feature in the Relief method.
(3) Gini index Gini index reaches maximum, it indicates the minimum useful information obtained. Its value means lesser   'impurity', the better attribute. Gini index is accounted using Equation (3) and the result is given in Figure 8. From Figure 8, the 3rd feature gets the smallest Gini index, so the finishing entry temperature is the most important feature in the Gini index method.

(4) T-test
In T-test method, the weight of every feature is calculated using Equation (4) that presents how significant difference by comparing the mean between the two classes. The larger the weight is, the more significant the feature is to separate the two classes. And then, the weights are described in Figure 9. Then we can get that the finishing entry temperature is the most important feature in the T-test method.

(5) Kruskal-Wallis
In Kruskal-Wallis method, the weight of every feature is calculated using Equation (5). The larger the weight is, the more important the feature is. And then, the weights are described in Figure 10. So finishing entry temperature is the most important feature in Kruskal-Wallis method, and the thickness reduction ratio on its heels.

(6) Mutual information entropy
The mutual information I(X, Y) is used to quantify how much information shared by two variables X and Y. In  the feature select process, X respects the feature and Y respects the quality information. The larger the mutual information I(X, Y) is, the more correlative between the feature and quality is. The mutual information between every feature and the quality is calculated using Equation (7) and shown in Figure 11. The finishing entry temperature is also considered as the most important feature in mutual information entropy method.

(7) MRMR
In MRMR method, the minimal redundancy will make the feature set a better representation of the entire dataset. The maximum relevance condition is to maximize the total relevance of all feature. To achieve minimum redundancy and maximum relevance, features ranking is computed as Equation (10). Then we can get that the finishing entry temperature is the most important feature in MRMR method.

Causes detection based on RankBoost
Finally, to summarize the results of the seven methods, the total importance of every feature can be obtained using the RankBoost method to select the most important features as the major causes.
Feature important rankings based on every method are collected in Table 4. Through Table 4, the 3rd feature is ranked as the most important feature six times in seven methods. But in the Relief method, the 3rd feature is not regarded as the first one. And the first five features in every method are not absolutely the same. If we only use one feature selection method, maybe cannot get the credible cause. To summarize the results of the seven methods, mean processing and RankBoost method are used and the result are shown in Table 5. In the mean processing method, features are sorted by the mean ranking values of each feature in all the methods. To show the result clearly, the selected features are used for classification, then fault detection rate and false alarm rate are used to evaluate the ranking results. The better ranking result is, the higher fault detection rate is and the lower false alarm rate is. The support vector machines (SVM) are introduced to classify the normal and abnormal samples. The parameters of SVM are optimized using cross validation. The results of fault detection rate and lower false alarm rate are shown in Figures 12 and 13 respectively with the increasing feature number based on every method.
As shown Figures 12 and 13, fault detection rate and false alarm rate are improving as feature number increasing in almost every method. But in mean processing and RankBoost method, both the fault detection rate and false alarm rate are improving faster and more steadily Table 4. Feature important orders based on every method.
around the lowest value when selecting 3 features and fault detection rate gets the maximum value when selecting 7 features. RankBoost algorithm can combine different feature rankings to a final feature ranking which is conducive to indicating the major causes. It is hard to select a suitable method in the real data set. If only one method is used to select the feature importance, maybe the wrong decision is done as Relief method in Table 4.
In the other hand, in most cases the fault detection rate is lower than the RankBoost method especially via smaller feature number as Figure 12. As a result, there are more mistakes about quality prediction based on one single method.
As shown in Table 4, the 3rd feature corresponding to the finishing entry temperature is the most important feature based on RankBoost method, and then the 1st feature corresponding to thickness reduction ratio is the second most important feature. In the actual manufacturing process, the control accuracy of the finishing entry temperature should be improved. To compare clearly, the finishing entry temperature between the qualified and unqualified steel is shown as Figure 14. In Figure 14, the first 890 values come from the qualified bendability and the others from the unqualified bendability. When the finishing entry temperature is small, there is more probability to unqualified bendability. To improve the bendability, maybe we should increase the finishing entry temperature. In the real hot rolling process, there are many quality parameters, it is a multi-objective optimization problem. Besides, the thickness reduction ratio should be optimized in the future.

Discussion and conclusions
In this paper, a model to find the causes of bendability of hot rolled strip based on improved RankBoost method with multiple feature selection algorithms using historical data is built. Seven feature selection methods including Fisher score, Relief, Gini index, T-test, Kruskal-Wallis, mutual information entropy and Minimum redundancy maximum relevance, are used to rank the significance of features individually. Finally, to summarize the results of the seven methods, the total importance of every feature can be obtained using the RankBoost method to select the most important features as the major causes. Nine hundred and sixty samples including 890 qualified and 71 unqualified are collected to validate the model. The result shows that the finishing entry temperature is most important feature that causes the unqualified bendability. In the actual manufacturing process, we should to improve the control accuracy of the finishing entry temperature.
The cause detection based on feature selection me thod can be applied, when a large number of unqualified products appear. But we only can give the reason from all the unqualified products, and cannot give the cause of only one product. The number of unqualified products is always smaller than the normal products, so there is an unbalanced classification problem. In the future the feature selection method should be improved considering the unbalance.