Generation of synthetic manufacturing datasets for machine learning using discrete-event simulation

ABSTRACT Recent advances in computing power have seen machine learning becoming an area of significant interest in manufacturing for scholars attempting to realise its full potential. Successful machine learning applications require a great amount of specific production data that is not easily nor publicly accessible. This study aims to develop a framework to use discrete-event-simulation (DES) to generate large datasets for training machine learning models. Three DES models were designed and executed to generate synthetic production data for different manufacturing scenarios. Inferences were made on the dependency between the time required to generate data and the complexity of the simulation model. The experimental results show that with the incremental changes in the simulation model, the time required to generate synthetic data tends to increase. The study revealed that DES is an effective tool for generating high-quality synthetic data which can be fed into machine learning models for training. The datasets generated by the simulations are made publicly available.


Introduction
In order to successfully apply machine learning, there must be sufficient datasets for machine learning model training (Ge et al., 2017;Wuest et al., 2016). For manufacturing problems, the data can either be collected from real-world production systems or virtual models of the production systems through DES. If field data (or real data) can be collected, researchers can use them to train machine learning models. Alternatively, researchers have to generate synthetic data through DES to overcome the lacking of data (Denkena et al., 2014;Lechevalier et al., 2015;Zhang et al., 2018). The synthetic data generated by DES can be used independently or combined with real data for machine learning (ML) model training.
There are many examples of ML applications in manufacturing using real and/or synthetic data. These applications have been reported in a wide range of manufacturing problems such as quality control Wuest et al., 2014), demand forecasting (Bajari et al., 2015;Lee et al., 2014), condition monitoring (Azadeh et al., 2013;Janssens et al., 2016), job shop scheduling (Shahzad & Mebarki, 2012), lead time prediction (Pfeiffer et al., 2016), supply chain capacity prediction (Silva et al., 2017), and bottleneck analysis (Subramaniyan et al., 2020(Subramaniyan et al., , 2018. The use of synthetic data has also been reported in a number of production scheduling studies (Kaylani & Atieh, 2016;Kritzinger et al., 2018;Weichert et al., 2019). This paper proposes a step-by-step framework that uses DES to generate synthetic datasets for the training and testing of ML models. The datasets generated are generic and comprehensive so they can be used for multiple manufacturing problems. In this study, three examples of manufacturing systems were simulated, and the performance of the synthetic data generation process for each test study was evaluated. The full datasets are made publicly available through the cloud-based repository Mendeley Data (https://data. mendeley.com/datasets/3rw227zxt7/2), with a view to freely disseminate for future research in solving manufacturing problems using machine learning.

Existing research work
This section briefly reviews the implementation of DES to generate synthetic data for manufacturing problems and describes a research gap identified in the field of machine learning data generation research.
Because DES can simulate the randomness or uncertainty of events or processes, the generated synthetic data closely resembles the real data. Koh and Saad (2003) proved this by simulating multiple levels of dependent demands, and then generating data on how delayed part delivery and delayed finished product delivery were affected due to the uncertainty that occurred in the production processes. Maas and Standridge (2005) combined real data and simulation generated data for capacity analysis, resource allocation, scheduling, and inventory control. Huang et al. (2013) used DES to analyse the rescheduling problem for a more complex mixed-line production facility. Zhuo et al. (2012) utilised DES to improve space allocation for block assembly. Gyulai et al. (2014) argued that the data generated by simulation can be used as input to the same model for different production schedules. Their results showed that DES can be utilized as a tool to create a robust production and capacity control for flexible assembly lines. Nyemba and Mbohwa (2017) used DES to achieve higher throughput for a multi-product assembly plant and concluded that DES was useful for the company's production planning and scheduling.
In these synthetic data generation studies, the amount of data generated is usually small. Although many applications use DES to generate data, the number of applications that use synthetic data for big data analysis is limited (Greasley & Edwards, 2021). For examples, Priore et al. (2006) utilised DES to generate a modest size of 1000 data points for ML training to solve a dynamic scheduling problem; Shiue (2009) generated only 2000 data points to train the ML algorithm for shop floor control.
Spanning across multiple areas of DES, big data, and AI/ML in production research, synthetic data generation plays an increasingly important role that warrants more focused research attention. In data mining, the CRISP-DM (Cross Industry Standard Process for Data Mining) project specifies a comprehensive process model for conducting data mining projects. The process model is independent of both the industry sector and the technology used (Azevedo & Santos, 2008;Schröer et al., 2021;Wirth & Hipp, 2000). Although such a process model is available, most studies reported in the literature do not follow any specific process model. Similarly, our literature review shows that there is no standard model or framework exists for the synthetic data generation process. Therefore, this study aims to propose a structured framework for synthetic data generation using DES for manufacturing problems.

Proposed framework
The proposed synthetic data generation framework is described step by step as follows: (1) Define the layout and demand behaviour of the manufacturing plant.
(2) According to the defined manufacturing layout, use any DES software to construct the DES model. • In our test studies, the software ARENA was used.
(3) Use the constructed simulation model to generate manufacturing process data.
• The dimension of the data is dependent upon the complexity of the manufacturing layout. For example, in the first test study (simplest manufacturing layout modelled), nine dimensions of data are written, while in the third test study (most complex manufacturing layout), 77 dimensions of data are written. • Examples of process data include: ○ The demand of each SKU within the defined specific time frame ○ The utilization of each facility and number of each SKU produced by each facility ○ Time (Value-added time, Waiting time, Non-Value-added time, other time).
(4) Record the time required to generate data.
• In this study, the simulation time is available and can be retrieved from the metadata of the data files generated by ARENA. • This time information is necessary for planning the simulation runs and estimating the time requirements. (5) Plan the execution of experiments to obtain the full dataset.
• The full dataset will consists of repetitions of experiments using combinations of parameter ranges. Based on the data generation time obtained from the initial trials, plan to run experiments. Depending on the computing resources available, the experiments can be run in sequence or in parallel when multiple physical or cloud systems are available. (6) According to the plan, run the simulation experiments and save the complete dataset in files with all data points recorded. (7) Check, clean, combine and store the full dataset.
• At this point, the synthetic data generation process is considered complete. The dataset is now ready for the next phase of training machine learning algorithms, or big data analysis and/or visualisation.

Test studies
When simulating a dynamic system, a flowchart is often used to deconstruct the sequence of operations, particularly to define the elements which are within and beyond the scope of a simulation (Allen, 2011;Van der Zee & Van der Vorst, 2007). The scope of this research includes three DES models -sequential process, parallel process, and flexible manufacturing, as shown in Figures 1 and 2. These models share a commonality of mixing deterministic and stochastic modelling. Some input values are manually assigned, and some are randomly generated. One of the most important randomly generated inputs is demand, which imitates the daily variation (uncertainty) of consumption levels and changes according to a uniform distribution. The distribution parameters of the first, second, and third layouts are shown in Table 1. In addition, the demand occurrence time follows an exponential distribution. Assuming that all demand values have the same probability of occurrence, a uniform distribution is selected. Finally, a replication length of 24 hours and a warm-up period of zero are set.

DES model 1 -description
The first model represents a simple sequential manufacturing line with three processes: drilling, milling, and assembly. The layout imitates a simple production line where blanks arrive at the first station and require drilling operations. The drilling station is capable of processing up to fifteen blanks simultaneously. The processing time follows a triangular distribution with parameters given in Table 2. If there is no available machine (resource)   online, the part will wait in a queue, which is an infinite size buffer for simplification. A similar approach is taken in the second and third stations. The drilled part moves to the next station where milling is performed. When finished, it moves on to the assembly process, where additional standard parts are used, which is out of the scope of the model. It should be noted that this simplified model assumes that the movement of parts between stations does not require time (nor changeover time), the service life of the machines is unlimited, and all queues are infinite. As shown in Table 2, each station follows the same distribution with the same parameters. The differences in utilisation between stations were achieved by assigning different numbers of available machines. For example, the assembly station was intentionally bottlenecked by assigning the lowest number of available machines. For each DES model, the data features and time requirements for generating the synthetic dataset will be described. For consistency, all simulation experiments were conducted on a Microsoft Windows PC with an Intel CPU (4 Cores 3.8 GHz) and 16 Gigabytes of Random Access Memory (RAM).

DES model 1 -data features
The synthetic dataset of model 1 consists of nine features of data, collected during simulation runs: • Demand as number of parts counted at the end of replication -1 feature (part counter). • Average number of parts counted per hour throughout the replication -1 feature (part counter). • Average of total value-added time throughout the replication -1 Feature (time). • Waiting time per process (drilling, milling, assembly) -3 Features (time). • Utilisation per process (drilling, milling, assembly) -3 Features (utilisation).
Demand is assigned on a random basis and assumed to be fulfilled by daily production (1 replication or 24 hours). The average part per hour is written as a validation tool for the model as this feature will be monitored during a simulation run. The average total value-added time shows how long a part will be processed at all the stations combined. The value-added time can be written for each station or process separately. However, since process time is set to be identical for all three stations, and the layout is a series system, the value-added time is instead written as total time, while waiting time is the average time a part waits in queue before it is further processed. Since all stations differ in resource quantities, waiting time is written separately for each station. Utilisation is the proportion of resources occupied, averaged throughout the replication, to show how busy a process is. The results of the data generation stage can be shown as distributions of the generated synthetic data. The total number of products produced per day (or in this case equal to demand) and the utilisation of each station are shown in Figure 3. The capacity of the milling station is the highest; therefore, its utilisation is the lowest, whereas the capacity of the assembly station is the lowest; therefore its utilisation is the highest (near 100%). The number of parts produced per hour tends to follow a uniform distribution, so the same is true for utilisation. These results showed that the simulation model is valid and the generated data covers most possible production scenarios, which is valuable for training machine learning models. These findings are important as they demonstrate the ability of DES to generate synthetic data for training machine learning algorithms.

DES model 1 -data generation time requirement
The next step in the analysis is to investigate the time required for generating data for the DES model. The simulation time can be obtained from the metadata of the generated synthetic data. In each experimental run, synthetic data is generated from 60,000 replications. The time to generate data is aggregated for every 1000 replications, so there are 60 data points in a simulation run. The time required for generating data for each model is shown in Figure 4. For model 1, in each experiment run, it was observed that the time required to generate an additional 1000 replications increased by 2.716 seconds. This phenomenon is suspected to be due to the accumulation of computing resources during the simulation process, resulting in a decrease in available memory. The time required to generate data is dependent on the process sequences, and the complexity of the manufacturing layout. This will be discussed further in the time requirement section of models two and three. A summary of the time parameters is provided in Table 3.

Simulation model 2 description
The second layout contains the same three stations but is organised differently. Both drilling and milling work as an independent parallel process. The stations receive their own blanks -Part 1 and Part 2 respectively. Similar to the first test study, the processing time follows a triangular distribution. When there are not enough machines to process all parts, they wait in a queue before the next corresponding station. Once the parts are processed, they are sent to two corresponding infinite size buffers and held until the next station is ready and available. Both parts are picked up and moved to the assembly station by a picker. Some assumptions about the work of pickers are: • Both drilled and milled parts must be available to be picked up, otherwise, the picker will wait. • At least one machine must be free at the assembly station, otherwise, the picker will wait. • No time is required to pick up and deliver parts. • It takes 10 seconds for the picker to return to the starting point.  The assembly process also follows a triangular distribution. However, in contrast to the previous layout, parts do not queue before the assembly station but are held in the warehouse. Hence, it is possible to track the statistics of each part, such as value-added time. Detailed information about processing times and resources is consolidated in Table 4. The expanded flowchart of the layout can be found in the Mendeley Data repository.
The second model layout is similar to the first model but with a different workflow. This is to allow a comparison between the models, and particularly to investigate how each performs differently.

DES model 2 -data features
The synthetic dataset of model 2 consists of 16 features of data, collected during simulation runs: • Total part produced in each station -3 features (part counter). • Average inventory level or total part stored in drilling and milling warehouses -2 features (part counter). • Average value-added time for each station -3 features (time). • Average waiting time for each station -3 features (time). • Average time part spent in drilling and milling warehouses only -2 features (time). • Average utilization for each station -3 features (utilisation).
Most of the data features are defined in the same way as in the first model. The new features are the part counters, which are used to record the average inventory levels of the drilling and milling warehouses; and the time features, which are used to record the average time that a part stored in the drilling and milling warehouses. The total number of parts produced per day and the utilisation of each station are shown in Figure 5. Since the demand parameters are equal, the two distributions of parts produced per day are similar. However, the distribution of finished products produced per day is expected to be different from the distributions of parts. The slight difference is due to the fact that some unfinished parts are stored in the warehouse at the end of each replication; however the measures of central tendency are similar. The utilisation results are also in line with expectations. Similar to model 1, the milling process exhibited the lowest utilisation due to its highest capacity. The simulation itself was valid, and no outliers, missing values, or any other inconsistencies were observed.

DES model 2 -data generation time requirement
Due to higher complexity, the data generation time for model 2 is expected to be longer. To ensure that data can be obtained within a reasonable time frame, the number of replications per run has been reduced to 30,000. The time to generate data is aggregated for every 1000 replications, so there are 30 data points in a simulation run. The orange line in Figure 4 shows the data generation time of model 2. Similar to model 1, the time required to generate 1000 additional replications is not constant, but an increase of 3.005 seconds. The similarity can be seen through the two almost parallel lines shown in Figure 4. Table 5 summaries the time parameters of model 2.

Simulation model 3 description
The third model is the most complex and adopted from Slack et al. (2010). Changes have been made to the model to better suit this research purpose. The manufacturing site produces four types of SKU, each has its own parameters including demand, weight, type, and processing time by station. The original production site contains both sequential and parallel processes, similar to the cases discussed above. However, in addition to the original scope, a flexible manufacturing facility has been introduced and included in the model. According to Das and Nagendra (1997), a flexible manufacturing facility can adapt to changes, both in its internal and external environment, quickly and economically; and particularly, to provide routing flexibility, which is the ability to manufacture a product via several alternate routes in the same facility. The flexible layout is set to  198.333 Seconds Gradient -average increase of time every 1000 data in the run 3.005 Seconds Gradient -average increase of time every 1000 data (first 12 sequences) 3.608 Seconds become accepted by an increasing number of SME manufacturing and growing demand for customizable products as a part of e-manufacturing development stream, which also implies digitalization, mobility, and immediacy (Cheng & Bateman, 2008). Refer to Figure 2 for layout and RHC19 for detailed process flowcharts.
In this model, a forklift is used to move parts in batches from station to station. The forklift will wait until the total weight reaches 2,000 kilograms, which is the maximum load of the forklift. Each part is assigned a specific constant weight, corresponding to a specific SKU type. The SKU parameters are shown in Table 6. With the introduction of a forklift, according to the requirements of ARENA, the distance between stations needs to be added in the DES model. Moreover, travel between stations does take some stochastic time within a predefined range. In addition to the introduced travel time, the loading and unloading time at each station need to be modelled as a constant. If parts are queuing during the processing interval, it is assumed that there is an infinite buffer in front of the workstations.
The DES model starts when the raw material (coil) arrives at the blanking station, where it is processed into one of the four SKUs at the blanking station. Processing time includes setting time and blanking time. According to the SKU type, the blanks are then loaded and sent in batches by the forklift to the corresponding pressing stations, which work as independent parallel processes, for pressing. The post-pressing parts are then transferred to the assembly station in batches by the forklift. There are four flexible manufacturing cells in the assembly station. Each cell is capable of processing specific SKU types with different processing times and variances, and is equipped with an infinite input buffer, modelled as a warehouse with unlimited capacity, to hold incoming parts. The same transfer logic of using the forklift applies to moving the assembled parts to the last station for painting and quality checking. The paint and quality station mimics two conveyors, each capable of handling 3,600 parts at once, and a simple quality check station (Harun & Cheng, 2012). Processes including primer coating, painting, and furnace drying are performed in the cell, requiring a total process time of 90 minutes. Quality control is the last process, with processing time modelled as a triangular distribution. Immediately after the quality check, the SKU is considered as produced; and consolidated information on inter-arrival times, processing times, capacities, and other variables are recorded in the dataset.

DES model 3 -data features
Model 3 is the most complex, with series and parallel layouts and flexible workflows. The number of data features has increased significantly, and so is the time required to generate data. There are five groups of data features, and a total of 77 features, as shown in Table 7. The complexity of model 3 resembles a real production layout. Massive amount of data, a total of 730497 rows, was generated in this test study. In order to gain insight into the production, one of the runs containing 3000 replications was randomly selected for data visualisation. The distribution of total daily production, and the distribution of daily production by SKU, are shown in Figure 6. The total product assembled by SKU in each cell per day, and the distribution of assembly cell utilisation, are shown in Figure 7. These figures demonstrate that DES is capable of generating massive amount of well distributed synthetic data, covering a large number of production scenarios of a complex manufacturing layout.
this phenomenon is suspected to be related to the gradual depletion of memory and other hardware or software related issues. A summary of time requirement parameters for data generation is provided in Table 8.

Discussion
The proposed framework has proven to be robust in synthetic data generation. Generally speaking, a more complex manufacturing system requires more data features to fully describe its system behaviour. Therefore, the number of data features can be regarded as a proxy for system complexity. Figure 8(a) shows that as the number of features (or model complexity) increases, the time required to generate data also increases. Due to the high complexity of the third model, the time required to generate synthetic data has increased by nearly 800% and 570% compared to the first and second layouts. The increase is significant, but not unexpected. As mentioned before, the data generation speed tends to slow down gradually during a run, most likely due to memory and other hardware/software issues. Figure 8 (b) shows that this extra time requirement, for generating 1000 additional data points in the same run, will also increase when the number of features (or model complexity) increases. Since the data generation time is specific to the model and computing environment, trial and error is necessary especially in the early stages to gain some insights for planning the entire data generation process. Another consideration is to find a practical balance between number of runs and number of replications to obtain the right amount of data.

Limitations and future research
It has been demonstrated that using DES to generate synthetic data is a fast and reliable approach. This method is a viable alternative when real manufacturing data is not available or difficult to collect. An important limitation is that DES can be used to generate data for a steady-state production; while extreme/rare events, such as extended machine breakdowns or major supply chain interruptions, are not considered. In order to cover both steady-state production scenarios and extreme/rare events, different datasets with different data features may be required. It is also possible to add more features in the dataset to achieve wider coverage and solve more problems. For example, adding cost-related data features to the datasets will allow for investigation of production cost optimisation problems. There are five research directions identified for future work: (1) Synthetic data generation is useful for many different types of manufacturing problems, from analysis, planning, optimisation, to decision making. Conducting a meta-analysis of the classification and requirements of data features in different manufacturing problems will help confirm the applicability of DES to a much wider range of manufacturing problems, and provide valuable insights for further improvement of the proposed synthetic data generation framework. (2) Investigating the resulting ML model performance by using real data and synthetic data. Real data collected naturally contains noise due to the environment and many other external and internal factors. Therefore, real data is never perfect and often suffers from the corruption that may hinder the performance of the ML algorithm (El Emam et al., 2020;Wu & Zhu, 2008). Moreover, there is a relationship between the real-world data and DES. When real data is used to construct the DES model and if the real data is corrupted, then the DES model will also be corrupted. To improve the ML model accuracy, data cleaning can be applied to identify the incorrect, incomplete, inaccurate, or missing data and then modify, replace or delete accordingly. However, DES can produce synthetic data by reverse-engineering the real data to model its statistical properties and distributions. The problem with synthetic data is that generating good synthetic data is hard (Strickland, 2022). To extend the current research, designing and conducting experiments to analyse and compare the robustness, accuracy, and effectiveness of the ML models learned with cleaned real data, and synthetic data with different levels of controlled noise will help understand how to generate better synthetic data for manufacturing problems.
(3) Conducting a conceptual study on how real data and synthetic data, both as historical and real-time data, can be augmented by domain experts, from the perspectives of data, information, and knowledge. Designing and building frameworks that effectively integrate and enable human-machine collaboration in solving manufacturing problems. (4) Conducting an experimental study in integrating real and synthetic data using existing tools to solve real-time manufacturing problems. Existing tools such as MTConnect and Apache NiFi are promising. MTConnect provides a standard solution to collect data from production machines and devices; while Apache NiFi supports automation of data flow between software systems (Cui et al., 2019). In addition to structured data, the experiments can also be extended to include unstructured data types that arrive into the systems as data-at-rest or data-inmotion (Cui et al., 2020b). (5) DES and synthetic data generation should be considered as an integral part of the big data ecosystem (Cui et al., 2020a). The software tools (e.g. AI, ML, analytics, visualisation, and workflow) that currently exist in the ecosystems are powerful enablers for innovative manufacturing applications and solutions that bridge the virtual and real worlds. Applying these tools, together with DES and synthetic data, to develop creative applications is an exciting area of future research.

Disclosure statement
No potential conflict of interest was reported by the author(s).