Automatic steel labeling on certain microstructural constituents with image processing and machine learning tools

ABSTRACT It is demonstrated that optical microscopy images of steel materials could be effectively categorized into classes on preset ferrite/pearlite-, ferrite/pearlite/bainite-, and bainite/martensite-type microstructures with image pre-processing and statistical analysis including the machine learning techniques. Though several popular classifiers were able to get the reasonable class-labeling accuracy, the random forest was virtually the best choice in terms of overall performance and usability. The present categorizing classifier could assist in choosing the appropriate pattern recognition method from our library for various steel microstructures, which we have recently reported. That is, the combination of the categorizing and pattern-recognizing methods provides a total solution for automatic quantification of a wide range of steel microstructures.


Introduction
Microstructure phase analysis is one of the primary interests in metallurgy because formed microstructures significantly determine the material properties [1][2][3]. Most of the data for such studies are coming from optical-or electron microscopy imaging techniques. By using modern industrial optical microscopes with advanced automated imaging equipment, scanning stages, and even sample slicing for threedimensional imaging, the amount of image data become overwhelming for manual examinations, and, in some cases, the relevant information could be hidden even for an expert. In this respect, there are big hopes that machine learning (ML) techniques could assist in automatization of routine tasks on big image datasets and that information gains could even unveil the new material paradigms. For metallurgy, the progress in such a direction has the paramount importance. So far, however, the reported image analyses with ML on metallic materials were extremely limited and, in most cases, were not compared/matched to the standard analytical methods in the metallurgical industry [4]. This situation is in strong contrast with an explosive number of scientific publications on ML applications to material science problems by using the physicochemical and structural property datasets/databases [5][6][7][8][9] as well as other knowledge sources [10,11].
Consequently, we have recently reported the successful application of random forest (RF) ML classifiers to segment the optical microscopy images of typical steel materials on ferrite (F), pearlite (P), bainite (B), and martensite (M) microstructures, with possibilities of further segmentation/analysis of their corresponding sub-phases [4]. The key point of this technique was the possibility to get the excellent quality of quantitative results, which was comparable or even better than manual estimations by experts, not to mention the benefits of microstructure phase area visualizations, analysis speed boosts, and automatizations with computers.
As with any single techniques, there were also some limitations of our previously established RF classifiers. One of them was that number and type of microstructure phases/classes (j ¼ A; B; C; :::; n) present in training image dataset must correspond to the same set of j in dataset being analyzed in order to get the correct results with the specific RF classifier. This poses less of a problem for an expert who recognizes well the type of j in images for segmentation, but for any use in a large scale of industrial analytical laboratory or for the use by a nonexpert in steel/metal microstructures, this limitation should be properly addressed. The prime aim of the current work is to develop additional image processing/analysis protocols for assisting in choosing the appropriate RF classifier to the image of interest. Figure 1 shows the envisioned general workflow of image analysis for metallic materials with ML and image processing tools. The current analysis module and its contribution to the workflow are shown by highlighted/spotlighted orange color. In the case of the human expert in metallurgy field, it is often enough to have just a brief look at the microscopy image of steel material for judging what types of microstructures, e.g. F/P-, F/P/B-, or B/M-type ones, the image shows. To do it on a computer in an automated way could bring the great benefits, for example, in analysis of a welded part of steels that should contain all sorts of the microstructure types at different areas. The high-resolution image data of the whole welded part could contain thousands of images, which hold a lot of useful information, but which are difficult/impractical to sort and analyze manually. This is our first attempt to address this problem.

Experimental details
All samples of A-type steel (see composition in Figure 2) obtained by cooling of this alloy from 1400°C at 0.3, 1, 3, and 10°C/s rates were prepared by a conventional polishing and subsequent etching with 0.5% picric and 0.5% nitric acids in ethanol [12]. The microstructures in our samples correspond to the typical ones appearing in a welded part of steels. They were imaged with same spatial resolution on BX53M optical microscope (Olympus, Japan) equipped with MPLN50x objective (Olympus, Japan) and DP22 CCD camera (Olympus, Japan), which satisfied the pixel size requirement for maximum optical resolution (see grayscale image in Figure 2). The samples at 0.3, 1, 3, and 10°C/s rates were identified as (F gb +F all +F sp )/P, (F gb +F all +F sp )/P, (F all +F sp )/P/B, and B/M-type microstructures, respectively, with different contributions of ferrite sub-phases: allotriomorphic (F all ), grain boundary polygonal (F gb ), and side plate (F sp ) ferrites.
The image pre-processing, analysis, and segmentation were conducted by using the open-source FIJI software package with Trainable Weka Segmentation plugin on two-CPU 6128 Optiron Workstation with 128 GB RAM [13][14][15][16][17]. The ML on training/test datasets with statistical image data for corresponding image labeling on present microstructures was performed by using the open-source WEKA software package, which is a collection of ML algorithms for data mining tasks [18].

Results and discussion
Before discussing image labeling, Figure 2 shows the example of the application of the RF ML algorithm to train it on four microscopy optical images of A-type steel and segment them on F and P microstructure phases, which are formed by cooling from 1400°C at 1°C/s rate. The successful application of automatic pattern recognition with such well-trained RF classifier was confirmed by an expert with manual visual inspection and line-intercept analysis (see percentage numbers in Figure 2). In principle, such RF classifier could be applied to any number of other images containing F and P microstructures if such images have similar quality and same spatial resolution.  Example of image segmentation protocol on F and P microstructure phases by using the image pre-processing and random forest classifier tools. The color coding on F and P microstructure areas of A-type steel is given on insert together with results of quantitative manual and automated analyses.
However, as discussed in Section 1, the required prior knowledge on steel microstructures (i.e. the type of microstructures in the image of interest) could be problematic for choosing the RF classifier correctly. As one possibility to solve this problem, we extracted the statistical attributes from image datasets for various cooling rates and tried to classify/label each image into F/P-, F/P/B-, and B/M-type microstructures with ML tools. Figure 3 shows the protocol of image processing and subsequent attribute extraction: (1) automatic conversion of color image stacks to 8-bit gray level ones for each A-steel cooling rate; (2) consistent automatic optimization of brightness and contrast of resulted image stacks based on analysis of the image histograms; (3) automatic conversion of resulted image stacks to binary ones (black and white) by analyzing image histograms and subsequent thresholding; and (4) automatically count and measure the black objects/particles in stacks of resulted binary images. Table 1 lists the names and descriptions of absolute or mean values of estimated attributes for each image (one data point) in resulted training and test datasets. In total, there were 117 and 73 of training and test data points/ images in 21-dimensional attribute space to work with, respectively. They were used to train and test the different ML classifiers in order to classify into the four types of microstructures. Note that some attributes in Table 1 were completely irrelevant to our classification problem, redundant, and correlated to others. They were left on purpose and used to check the attribute selection/ reduction tools and robustness of different classifiers to such datasets. This behavior is important to know and consider if adding new attributes in the future will be necessary or they will become available.
Some differences between training and test datasets were also added to model the possible real-world applications: (1) different samples for test dataset were prepared time apart from training dataset, but with the same composition and cooling rates; (2) the data in test dataset for 3°C/s cooling rate were extracted from images with bad quality of sample etching for B-phase visualization; and thus, F and P phases were mainly visible in those images. Such data were used on purpose in order to check the sensitivity of attributes and classifiers to data quality variations, which could be typically encountered in actual experiments. Figure 4 shows the examples of 3D and 2D slices of such 21-dimensional attribute space. In spite of considerable scattering, there is a visible tendency for data points in training and test datasets to group depending on types of microstructure combinations. It is also seen in the 2D slice of surface density vs. aspect ratio (plotted on gray plane) that largest deviations between training (dark green) and test data (light green) are for 3°C/s cooling rate ones, which is reasonable due to bad etching quality mentioned above. This can also be observed on 3D/2D slices for different attributes (see Supplemental Material). Then, we have applied several ML techniques to train, classify, and validate the resulted classification accuracy with them. For comparison, Table 2 lists the used methods and results of 10-fold cross-validation on the training dataset, which was utilized to build classifiers. On  Figure 4. The 3D/2D slices of the multidimensional attribute space for training and test datasets of image statistics. Each attribute data point corresponds to the absolute or mean value for a single image (see Figure 3 and Table 1). The mapping example of 3D points to the 2D plane is shown for clarity with blue color lines. Additional 2D slices in the form of partial scatter matrix are available as Supplemental Material. the next step, such classifiers were applied on test dataset for additional validations. The corresponding results are listed in Table 2.
In addition to RF, multilayer perceptron (MP), and other individual classifiers, we also used the state-ofthe-art Auto-WEKA, which is the RF-based Bayesian optimization module for tuning hyperparameters of the classification algorithms built into the WEKA software package [19,20]. It works with 27 base ML algorithms, 10 meta-and 2 ensemble-methods, combining 3 search and 8 evaluator techniques. In total, 789 hyperparameters from all classification algorithms and feature selectors/evaluators could be used in optimization depending on accuracy, time, and PC resources. Table 2 lists several Auto-WEKA runs with names of classifiers and main arguments, which is selected for the given computation time. Note that Auto-WEKA generates a new classifier with optimized hyperparameters, which could be conveniently applied in ordinary WEKA way to the test datasets. The Auto-WEKA deals with hyper-dimensional surface, which has many local minima; therefore, more classifiers could be created with different or same parameters depending on the number and setting of the Auto-WEKA runs. To our knowledge, this is the first application of Auto-WEKA technique to the material science problems.
From Table 2, it can be seen that RF, MP, and Auto-WEKA can easily achieve the 96.6-100% accuracy with 10-fold cross-validation on our training dataset. Note that diagonal and off-diagonal elements in confusion matrix (CM) correspond to the number of correct and incorrect classifications, respectively (see Equation (1)). All other errors and statistical values can be easily calculated from such matrixes:  Table 2. Performance comparisons between random forest, neural network, and Auto-WEKA classifiers. The a, b, c, and d columns/row labels correspond to the instances of the classification on 0.3, 1, 3, and 10°C/s or (F gb +F all +F sp )/P, (F gb +F all +F sp )/P, (F all +F sp )/P/P/B, and B/M classes, respectively. The numbers in brackets are calculated by omitting the data at 0.3°C/s. All attributes from Table 1   average of the PPV c and TPR d , for classes a, b, c, and d, respectively: (2); ; In Equation (2), the TP a ¼ aa, FN a ¼ ba þ ca þ da, P a ¼ TP a þ FN a , and FNR a are the true-positive (hit), false-negative (miss), condition-positive, and falsenegative rate values for class a, respectively, with FNR a to be the proportion of positives which yield negative test outcomes. In Equation 3, the , and TNR b are the false-positive (false alarm), number of real-negative cases, true-negative (correct prediction of not belonging to class), and true-negative rate values for class b, respectively, with TNR b to be the proportion of actual negatives that are correctly identified.
Though Auto-WEKA classifiers slightly outperformed the RF and MP ones on the training dataset, but by applying these trained classifiers on the test dataset, the accuracy of correct prediction dropped significantly and differently for all of them with the lowest decrease for RF (see the last column in Table 2). Below, we will demonstrate in more details the overall superiority of RF classifier for our microstructure-labeling problem. Note that increased misclassification on test dataset was mainly due to the instances at 3°C/s cooling rate. Actually, this was the reasonable and good result since these data were obtained from images with insufficient etching to visualize B microstructures. Then, the classifiers mainly assigned these test data to the F/P-type microstructure (b-column in CM), which indicates that our method spotted this problem automatically. The small misclassification between data at 0.3 and 1°C/s cooling rates was also understandable since both samples are F/P-type microstructures and differentiated in F sub-microstructures, i.e. in relative contributions of F all , F gb , and F sp . However, in total, our method can distinguish reasonably well such F/P microstructures at 0.3 and 1°C/s cooling rates. In principle, more precise labeling on the relative contribution of particular F sub-microstructure could be feasible with some attributes from Euclidean distance conversion technique (see, for example, [4]). If to remove 3°C/s data from test dataset, then classification accuracies had improved significantly with all classifiers (see values in parentheses in Table 2). By comparing these values, it can be noted that classifiers with the RF algorithm in their core produced the most accurate predictions.
Actually, this was not a surprise. It was reported that by applying 179 classifiers on 121 datasets from the UCI database, the RF classifier versions produced the best results in most of the cases [21]. A similar conclusion was also derived for smaller scale investigation with 65 WEKA classifies on 3 datasets [22]. In our case, the basic RF classifiers even slightly outperformed the Auto-WEKA classifier with RF in its core (98.6 vs. 97.1%). Probably, it was due to only three trees in RF with Auto-WEKA and better internal attribute selection in base RF algorithm compared to greedy stepwise and feature subset selection techniques applied to the training dataset prior to RF classifier creation in Auto-WEKA. Figure 5(a) demonstrates such internal attribute selection in the process of basic RF classifier buildup by plotting of normalized attribute counts from all RF trees. Figure 5(a) also shows that there was little difference between 100 and 1000 trees in the forests, which related to the well-known leveling-off the classification accuracy for forests with more than~100-200 trees [23,24] and better RF classifier robustness with respect to noise [25]. The attributes, which are irrelevant to our classification task, such as Mode, Mean, FeretX, and FeretY got zero or very low counts in tree nodes due to insignificant statistical effect on information gains in a process of tree growth. Figure 5(a) also demonstrates that RF classifier is not a 'black box' since attribute importance could be extracted and each tree has very clear meaning (see Figure 5(b)). The base RF classifier can be also created much faster compared to Auto-WEKA one. Therefore, the automated hyperparameter optimizations still require the careful comparisons. Nevertheless and apart from required PC time and resources, all classifiers listed in Table 2 can produce quite good results with~86-99% accuracy (see values in parentheses).
From a practical viewpoint, it is better to do deal with attributes in datasets, which are robust to image scale/area. Otherwise, the images for testing should be collected with the same spatial resolution or additional normalization is necessary. Such robust attributes are the Surface Density, Aspect Ratio, Circularity, and Solidity ones. However, this manual attribute reduction could lead to the decrease in classification accuracy. Table 3 shows the performance of several well-known ML techniques with such reduced training/test datasets. Again, the best classification accuracy was obtained with RF classifier. In this case, the drop in accuracy was only~1.5% compared to the use of all 21 attributes. Figure 6 demonstrates the differences between classifiers in terms of their decision boundaries, which were estimated with smooth interpolations. As indicated by labels, each class maximum probability is assigned with single-color code. Then, RGB pixel color in discretized attribute space is defined by a linear combination of estimated class probabilities (weighted averages based on kernel density estimators), which are calculated by sampling of the points in corresponding attribute space with classification models [26]. Except k-means and k-nearest neighbors classifiers with over-fragmented boundaries, other tested ones have more similarities in this respect. Nevertheless, the RF classifier decision boundaries automatically partitioned the relevant attribute space in the simplest way but in accordance with our intuitively expected distribution of such boundaries for different steel types.
Here, it should be mentioned that Tables 2 and 3 list results with default WEKA settings for all classifiers since they still produced over 90% labeling accuracy on the training dataset, except with k-Means one. Note that these settings are based on known behaviors of these algorithms, i.e. they are not arbitrary for a good start. In principle, the multiparameter optimizations could be further applied on single classifier from Table 3 with MultiSearch metamethod in WEKA, which could optimize arbitrary number of user-defined parameters and their ranges after attribute selection/filtering with other tools. We did not proceed rigorously with such single classifier tuning due to plausible gains with unjustifiable efforts compared to already achieved high accuracy with default settings for RF algorithm:~98/99% accuracy on training/test datasets without any external attribute selection/filtering (see Table 2). Nevertheless, these tools could be useful for other training datasets and they are available.

Conclusions
We have developed the protocol with image processing and ML tools to label the images of steel materials on typical F/P-, F/P/B-, and B/M-type steels. The RF algorithm or its modifications performed overall better compared to other ML tools due to its ensemble, unbiased, and stable nature. Our technique of image attribute extraction and consequent ML application on image datasets could find potential applications in metallurgical research/analytical laboratories. Table 3. Performance comparisons between random forest and other popular classifiers. The a, b, c, and d columns/row labels correspond to the instances of the classification on 0.3, 1, 3, and 10°C/s or (F gb +F all +F sp ) /P, (F gb +F all +F sp )/P, (F all +F sp )/P/P/B, and B/M classes, respectively. The numbers in brackets are calculated by omitting the data at 0.3°C/s. Attributes, which are robust to image scale, were used from Table 1.