Skill-Agnostic analysis of reflection high-energy electron diffraction patterns for Si(111) surface superstructures using machine learning

ABSTRACT Reflection high-energy electron diffraction (RHEED) data are important for the in-situ characterization of surface conditions during physical vapor deposition. Surface superstructures obtained by adsorbing exotic atoms onto a clean silicon surface, which exhibit various physical properties, were identified using RHEED. However, this information is too abundant for quantitative analysis; therefore, scientists rely on their expertise to interpret RHEED patterns to assess surface structures and evaluate film thickness, and a large amount of information remains unused. In this study, we adapted machine learning for a RHEED pattern dataset of a clean Si(111) surface during indium deposition in molecular-beam epitaxy growth to use the entire RHEED pattern image information and investigated appropriate machine leaning analysis methods. First, we aimed to determine RHEED pattern similarities in the dataset. Then, five structural phases, 7 × 7 (clean surface), √3×√3, √31×√31, 4 × 1, and 4 × 1 (Room Temperature), were automatically detected by hierarchical clustering using Ward’s method. Next, we aimed to extract the information for each surface superstructure from the dataset. Using non-negative matrix factorization, we successfully estimated the optimal forming conditions for each surface superstructure separately more accurately than the conventional methods. Our proposed strategies could be widely applied to surface structural analysis. GRAPHICAL ABSTRACT


Introduction
Clean semiconductor surfaces with well-controlled atomic structures provide attractive stages for various low-dimensional systems such as atomiclayer materials [1][2][3][4], heteroepitaxial multi-layers [5][6][7], topological materials [8][9][10], and Rashba films [11][12][13]. These surfaces and thin films are fabricated by physical vapor deposition (PVD). Reflection high-energy electron diffraction (RHEED) is a widely used technique for structural analysis of surfaces obtained by PVD growth. RHEED enables researchers to observe surface structures in situ during film fabrication because the deposition rate can be evaluated by monitoring the dynamic changes in the diffraction patterns and intensity oscillations of the selected diffraction spots. Thus, RHEED has been used in the process optimization for sample preparation. However, the intensity of the diffraction spot is affected by the background, neighboring spots, incident electron beam intensity fluctuations, and sample temperature, among other factors. Moreover, expert knowledge is required to interpret diffraction patterns. RHEED images contain a lot of surface information, including surface atomic structural and film quality information, such as step density, growth mode, and strain [14][15][16][17]. Therefore, an appropriate data processing technique is required to extract the necessary information and perform a high-accuracy analysis.
In this study, we propose a skill-agnostic analysis method for RHEED images. We applied unsupervised machine learning, multi-dimensional scaling (MDS), hierarchical clustering, PCA, independent component analysis (ICA), and non-negative matrix factorization (NMF) to RHEED pattern images of a Si(111) surface during indium deposition in molecular-beam epitaxy (MBE) growth. By introducing machine learning, scientists can thoroughly analyze a large amount of information. In particular, unsupervised learning methods are useful in practice for categorizing datasets in which we do not know which element will appear, such as time-series RHEED pattern image datasets taken during parameter sweep. Aiming to perform a skill-agnostic and fast analysis, we applied unsupervised learning to one-dimensional (1D) RHEED image arrays and worked on a highly accurate process optimization for sample preparation, as compared to the conventional thickness monitoring method.
To determine the number of surface superstructural phases included in the dataset, we attempted MDS and hierarchical clustering (3.1). Using Ward's method of hierarchical clustering, we automatically found that five structural phases were present. Subsequently, to extract information on each surface superstructure -namely 7 × 7 (clean surface), √3×√3, √31×√31, 4 × 1, and 4 × 1 (Room Temperature, RT)-PCA, ICA, and NMF were tried (3.2) [37][38][39]. Using NMF, the optimal forming conditions for each surface superstructure were individually estimated with high accuracy. Si surfaces are the most basic and widely studied semiconductor surfaces. Over the past few decades, various superstructures on Si(111) surfaces induced by several adsorbed atoms have been reported [1,2]. Each superstructure shows an amazingly different characteristic; therefore, it is important to identify them and optimize the sample preparation conditions.
The remainder of this paper is structured as follows. Section 2 presents the experimental procedures. Section 3 discusses the automatic detection of surface superstructural phase transitions and optimization of the formation of each surface superstructure along with the several machine learning methods. Finally, section 4 provides a brief summary and outlook.

Sample preparation
The substrate was an n-type (resistivity of 1-10 Ω·cm at 300 K) Si(111) wafer, the size was 3 mm × 20 mm. Si(111) is three-fold symmetric; however, to form single-domain 4 × 1-In chains, vicinal substrate with a miscut of 1.8° along the � 1 � 12 ½ � direction was used. The oxide layer was removed by direct current (DC) heating to 1200 ℃ for a few seconds in an ultra-high vacuum (UHV) chamber (1 × 10 −8 Pa). Then the substrate was annealed at 1050 ℃ to produce a Si(111)-7 × 7 clean surface terrace with one atom high steps. Indium was deposited onto this 7 × 7 surface from an aluminacoated tungsten wire basket at 450 ℃ until approximately a 1.0 monolayer (ML) indium chains were formed along the steps. Here, 1.0 ML corresponds to 7.84 × 10 14 atoms/cm 2 for Si(111).

RHEED image dataset
We started the DC sample heating and RHEED pattern recording and then opened the shutter of the deposition source to start the atomic layer growth after the sample temperature stabilized. The acceleration voltage for RHEED was 15 keV. The angle of incidence and beam current of RHEED were 3.0° and 36 µA, respectively. The shutter was closed to finish indium deposition after 392 s and DC sample heating was stopped after 421 s from the start of deposition. Finally, the RHEED pattern recording was stopped after the sample had cooled to room temperature. A total of 375 RHEED pattern images ([11 � 2] incident) were continuously taken for 480 s using a charge-coupled device (CCD) camera to monitor the structural changes caused by the surface phase transition due to the change in the indium deposition amount. The average interval between shots was 1.28 s, and the size of RHEED pattern images was 480 × 640 (h×w) pixels.
We used 208,251 pixels within a fluorescent screen, as shown in Figure 1 (red circle), from the 307,200 (480 × 640) pixels in a rectangular image from the CCD camera. The luminance of each pixel was obtained and the output was expressed using a 1D array.

Machine learning analysis
Machine learning analysis was attempted for 374 images after excluding one image with a shifted view; therefore, 374 1D luminance arrays were used as the dataset. We adopted MDS, hierarchical clustering, PCA, ICA, and NMF for RHEED image analysis. Machine learning was performed using the Scikit-learn packages for MDS, PCA, ICA, and NMF, and the SciPy package for hierarchical clustering. Unless specified, the default parameters were used. Data analysis was performed using the following machine learning software and libraries: Python (Version: 3.7.6), Scikit-learn (Version: 0.22.2), and SciPy (Version: 1.4.1). The computer used for the calculations was equipped with an Intel(R) Core(TM) i5 CPU with four cores at 1.60 GHz and 16 GB memory.

Surface superstructure transition
The Si(111) clean surface shows a structural phase transition according to the amount of indium adsorption [1]; therefore, the RHEED pattern image dataset taken during the deposition process includes information on several surface superstructural phases. To detect the number of surface superstructural phases, we applied MDS and clustering, which are general methods for confirming sample similarity in datasets. First, we performed MDS, which can express the overall relationships among all samples in two-dimensional mapping, to visualize whether clusters corresponding to each surface structure appear in the mapping. Figure 2 shows the MDS results using a six-point distance scale: 'Euclidean distance,' 'Manhattan distance,' 'Chebyshev distance,' 'Squared Euclidean distance,' 'Cosine distance,' and 'Hamming distance.' In Figure 2, the color of the plot changes from blue to light green over time. From Figure 2, the number of phases and phase transitions during indium deposition do not appear clearly. However, in Figure 2(a-e), the light green markers form a cluster, and the other markers are arranged in a continuous manner. The latter distribution occurs because the spot size of the electron gun in the RHEED measurement system was large (100-200 μm) compared to the crystal nuclei and domains, and the structural phase transitions in the RHEED patterns under the deposition process seems to occur continuously. The former distribution (i.e. the light green cluster) includes the RHEED images taken after the DC sample heating was stopped. At that moment, the intensity of the RHEED spots increased discontinuously with sample cooling. On the other hand, in Figure 2(f), with Hamming distance, the markers show a spiral-like shape, and the time-dependent change tendency in the mapping plot cannot be interpreted. This means that the Hamming distance is not suitable for our dataset.
Next, we performed a clustering analysis, specifically, hierarchical clustering. Non-hierarchical clustering, such as the k-means clustering, has often been used as a typical clustering analysis method in previous studies [34][35][36]; however, the number of clusters must be determined prior to the analysis by  scientists. Furthermore, the k-means clustering is vulnerable to unbalanced cluster sizes. Therefore, we applied hierarchical clustering to the RHEED image dataset. In hierarchical clustering analysis, the relationship between samples is shown in a hierarchical structure called dendrogram, which helps estimate the number of clusters visually. To perform hierarchical clustering, it is necessary to select the routine for combining clusters and the distance scale among samples.  Considering the MDS principle, in which similar samples are plotted closer and dissimilar samples are plotted farther in the selected distance scale, it is expected that the distance scales that capture the features of RHEED pattern changes in MDS mapping are appropriate for performing the hierarchical clustering. From Figure 2(a-f), the behavior using the Euclidean and Squared Euclidean distances show a particularly reasonable behavior. We adopted Euclidean distance, which is highly versatile for cluster-combining methods, and used seven routines for combining clusters, namely the single linkage, complete linkage, group average, weighted average, centroid, and median methods, as well as Ward's method. Figure 3(a) shows the dendrogram results and Figure 3(b) shows the temporal cluster change at the farthest distance in the hierarchy for the single linkage method, group average method, and Ward's method. The results using the complete linkage, weighted average, centroid, and median methods are shown in Figure S1 in the Supplementary Material. The number of clusters for each routine is summarized in Table 1. Although we succeeded in detecting the dynamical change in diffraction patterns using all linkage methods, as shown in Figure 3 and S1, the number of   clusters was different, as shown in Table 1. However, each cluster does not necessarily correspond to a specific surface superstructure. Ward's method shows the largest number of clusters, five is the optimum, as shown in Table 1. The average images of each cluster are shown in Figure 4. In Figure 4, Clusters 2-5 represent four completely different patterns, while Cluster 1 represents the pattern in which the intensity discontinuously increased by stopping the DC sample heating. Compared with the phase diagram ( Figure 5 [1 However, the other routines did not work well; for example, four structures were recognized during heating as a single cluster when a single linkage was used. Ward's method combines clusters together in such a manner that the increase in the square of the distance from the center of gravity due to the combining clusters is minimized. Although computationally expensive, Ward's method is highly sensitive, and therefore, in our case, it is considered to be the most successful in categorization. From these results, Ward's method was determined as the most appropriate for hierarchical clustering for RHEED pattern change detection. As shown in Figure 2, the dynamic change in the diffraction patterns due to the structural phase transition driven by the amount of adatom deposition occurs continuously. Therefore, cluster boundary detection is inherently difficult. Although Ward's method is based on hard clustering, our results suggest that Ward's method is a useful means to easily understand surface superstructural phase transitions.

Optimal deposition time estimation
In the previous section, automatic RHEED pattern detection was achieved via machine learning without prior knowledge or experience. Figure 6 presents the information on the phase transition obtained from the Ward's method in section 3.1, together with the experimental procedure. Although the hierarchical clustering is hard clustering, the actual phase transitions occurred in a continuous manner; therefore, a conceptual diagram of the phase transitions is shown in Figure 6. Next, we searched the image when the pattern of each structure most clearly appeared, that is, the condition in which each structure occupied the largest area of the surface. This enabled us to determine the optimal deposition time for fabricating each surface superstructure separately. Searching for the best conditions for synthesizing target materials using machine learning is one of the goals of materials informatics.
Scientists have generally identified surface superstructures based on their expertise. In addition, they have optimized the deposition time to fabricate a certain surface superstructure by observing timedependent intensity profile of RHEED spots from the aimed surface superstructure relying on their technique. This method is also used for thickness estimation using the intensity oscillation of RHEED spots. Typically, the intensity profile is detected at only one arbitrary RHHED spot. To obtain the intensity profile, a rectangular area around one arbitrary RHHED spot is defined; then, the luminance sum in that range is calculated. However, the accuracy of this method is poor because spots derived from unintended superstructures could exist in the selected rectangular area, and the effect of the background intensity, such as that from Kikuchi patterns, is inevitable. For example, in Figure 7, we attempted to estimate the optimal deposition time required to obtain a Si(111)4 × 1-In structure using the above mentioned conventional method. The RHEED spots in the red and blue rectangular areas are both derived from the 4 × 1 structure (Figure 7(a)). As shown in Figure 7(b), in the case of the red rectangular area, the intensity peak clearly appeared at 290 s, which can be determined as the optimal deposition time. However, in the case of the blue rectangular area, we cannot determine the optimal deposition time. This means that the position and size of the rectangular area significantly affect the results. Moreover, the results of this conventional method depend on the analyst's expertise. Accordingly, by using machine learning, we expect a more accurate estimation of the best condition using information from the entire RHEED pattern image dataset.
Therefore, we aimed to extract the components that reflect the information of each surface superstructure via dimensional reduction. First, we applied PCA, which is a typical dimensional reduction method and has been used in previously [34][35][36]. Figure 8 shows the cumulative contribution ratio of the PCA results. It was found that the number of principal components could be reduced to four when more than 90% of the information in the dataset was retained. Figure 9 shows the basis pattern images of the four components and their score change over time. A higher component score indicates that the RHEED pattern contains the principal component more strongly at that time. In Figure 9(a), the basis patterns differ from those in Figure 4, which correspond to the diffraction patterns of specific surface superstructures obtained by hierarchical clustering using Ward's method. The behavior of the component score change over time in Figure 9(b) is too complicated to interpret easily. Therefore, PCA is considered unsuitable for estimating the optimal deposition time for fabricating a specific surface superstructure. In PCA, the components are orthogonal to each other, but the RHEED patterns do not have orthogonal relationships, so PCA does not work well. Therefore, it is necessary to use a dimensional reduction method that extracts information under non-orthogonal constraints.
We also applied ICA, which is a dimensional reduction method under the constraint that the bases are independent. We assumed the number of independent components to be five because the number of RHEED patterns detected by hierarchical clustering was five. Figure 10 shows the images of the independent components and their score change over time. As with PCA, a higher component score Standardized total luminance in the two rectangles. In the red rectangle, the 4×1 structure is well observed; however, in the blue rectangle, the intensity change of the 4×1 structure is not observed.
indicates that the RHEED pattern contains the independent component more strongly at that time. Similar to PCA, the basis pattern images of each component in Figure 10(a) cannot be assigned to the typical diffraction patterns of specific surface superstructures, and the score change over time in Figure 10(b) is also complicated.
The reason for the incompatibility of PCA and ICA for dimensional reduction is that positive and negative values are included in the components; therefore, diffraction patterns from different structures appear depending on the values (positive or negative). To estimate the optimal vapor deposition time for fabricating a specific surface superstructure, it is desirable to extract non-negative components. Therefore, we applied NMF, which is a dimensional reduction method with the constraint that only nonnegative matrices can be decomposed. We performed NMF with number of basis k = 5-10. Figure 11 shows the difference between the images reconstructed using the basis patterns and the original images, that is, the reconstruction error. From the basis patterns with number of basis k = 5-10 shown in Figures  S2-S6 and Figure 12(a), it is evident that when the number of basis is less than 7, each structure cannot be separated, whereas when the basis number is greater than 7, the structures are too divided to be accurately interpreted. Therefore, basis number 7 appears to be valid. It should be noted that the number of basis patterns in NMF and clusters obtained via hierarchical clustering using Ward's method are different. The appropriate number of components for performing NMF should be determined by confirming each component, in addition to the reconstruction error. Figure 12 shows the NMF results with number of basis k = 7. Figure 12(a) shows the seven basis patterns obtained via dimensional reduction using NMF. Compared to the hierarchical clustering results using Ward's method, each pattern in NMF seems to correspond to the background (Basis Pattern 1), 7 × 7 structure divided into two basis(Basis Pattern 2,3), √3×√3 structure(Basis Pattern 4), √31×√31 structure(Basis Pattern 5), 4 × 1 structure(Basis Pattern 6), and 4 × 1 (RT) structure after stopping the DC sample heating  (Basis Pattern 7). Of the two 7 × 7 patterns(Basis Pattern 2,3), Basis Pattern 3 is suggested to be δ7 × 7 [40]. This means that the RHEED pattern images are decomposed into the background and five structures using NMF. As in section 3.1, the 4 × 1 structure(Basis Pattern 6) and 4 × 1 (RT) after the DC sample heating was stopped (Basis Pattern 7) are also separated in NMF. In Basis Pattern 7, a rapid coefficient change over time occurred when the heating of the sample was stopped. The intensity of the RHEED pattern decreases with increasing substrate temperature due to the effect of lattice thermal vibrations. In other words, the coefficient change in Basis Pattern 7 may reflect the effect of lattice vibration as the substrate temperature changes.
Let us focus on the time-dependent change in the coefficients of each basis pattern (Figure 12 (b)). Similar to PCA and ICA, the higher the coefficient value, the more strongly the RHEED pattern contained the basis pattern in the NMF analysis. Even though the 7 × 7 structure is divided into two basis patterns, Basis Patterns 2 and 3, Basis Patten 3 oscillates when Basis Pattern 2 decreases fast, resulting in an overall decrease. This is consistent with the expected surface superstructural transition, in which the Si(111)-7 × 7 clean surface changes into the next surface superstructure due to an increase in indium adsorption over time. Basis Pattern 4 in Figure 12 corresponds to the √3×√3 structure basis pattern and coefficients. Its maximum value was obtained at 129 s, so the optimal deposition time of indium for the √3×√3 structure was found to be 129 s. Accordingly, the optimal deposition times of indium for the √31×√31 and 4 × 1 structures are found to be 184 and 292 s via Basis Patterns 5 and 6, respectively, in Figure 12. Lastly, Basis Pattern 7 in Figure 12 shows the 4 × 1 (RT) structure after stopping the DC sample heating and the coefficients start to increase sharply at 421 s, when the DC sample heating was stopped. This behavior is consistent with the experimental procedure. From these NMF results, it is considered that NMF is a suitable method for decomposing the RHEED pattern into the corresponding main factors and estimating the optimal deposition time for fabricating each surface superstructure. Table 2 shows an estimated deposition time comparison between the NMF and conventional methods, the latter is shown in Figure 7. The NMF method has the potential to estimate the optimal deposition time automatically without prior knowledge regarding diffraction patterns and is more accurate than the conventional method. Figure 13 shows the optimal deposition time estimated by NMF method for each structural phase. The structures √3×√3 and 4 × 1 are composed of 1/3 ML and 1 ML of indium surface density, respectively [37]. Nevertheless, the estimated time for 4 × 1 was not 3 times greater than that for √3×√3. This may be related not only to the stability of deposition source, but also to the desorption and the diffusion of indium atoms. Film thickness gauges do not necessarily reflect the accurate adsorption amount of deposited atoms. Therefore, in-situ structural monitoring by RHEED and machine learning analysis of diffraction patterns are a powerful combination to estimate the optimal deposition conditions with high accuracy.

Conclusions
Herein, we propose an efficient method to analyze the entire RHEED pattern image dataset using machine learning for indium deposition on Si(111) surface superstructures. MDS, hierarchical clustering, PCA, ICA, and NMF were applied.