Preventive maintenance scheduling: a simulation-optimization approach

ABSTRACT This paper presents a framework for preventive maintenance (PM) scheduling in the semiconductor industry. We propose an approach for finding PM’s start time within a PM window to minimize production losses due to maintenance activities. In this study, we consider re-entrant process in which wafers will enter the same equipment location several times, but in different stages and sometimes different processes. Due to the optimization problem’s complexity, we develop meta-heuristics such as a genetic algorithm and particle swarm optimization to solve it and compare with the resource leveling as well as the baseline. In the algorithm, we embed discrete event simulation to mimic a wafer fab process and get its performance. The proposed approach able to identify the best arrangement of PM’s start time within a PM window and provides a way to optimize PM schedules for a complex system by simultaneously utilizing meta-heuristics and discrete event simulation.


Introduction
The semiconductor industry has grown rapidly to become a significant driver for economic growth. Many companies in this area are trying to compete in a competitive market by increasing their system performance in term of throughput rate, work-inprocess inventory, cycle time, and overall equipment efficiency (Montoya-Torres, 2006). In practice, industries are often interested in maximizing throughput rate to increase profits by maximizing equipment availability. However, this problem may not be straightforward to solve. Since equipment should be maintained at periodic time intervals usually within a PM window, and equipment cost is a large proportion of manufacturing cost in the semiconductor industry, optimizing preventive maintenance schedules is an important issue, and various conditions must be considered in scheduling equipment effectively.
There are mainly two types of maintenance activities: preventive maintenance, whose activities can be planned and corrective maintenance, which is related to non-foreseeable breakdowns (De Jonge & Scarf, 2020;Marmier et al., 2009). Many companies' resources have been diverted towards maintenance management due to the importance and complexity involved in planning maintenance (Assid et al., 2015;Cheng et al., 2018;De Jonge & Scarf, 2020;Li et al., 2009). In term of the equipment quality, Panagiotidou and Tagaras (2008) divide maintenance into two types. First one is minimal maintenance that upgrades the equipment's quality state without affecting its age. The second is perfect preventive maintenance that fully upgrades the equipment to the as-good-as-new condition. Sheu and Wu (1999) report that two thirds to three-quarters of the manufacturing costs are equipment cost. Therefore, preventive maintenance scheduling is an essential topic in semiconductor manufacturing systems. PM is a reasonable solution to reduce the probability of machine failures. However, PM actions increase the production cost and consume the system's time-capacity (Franciosi et al., 2020;Renna, 2012;Safaei et al., 2010). Previous studies have proposed possible methods for the preventive maintenance scheduling problem (PMSP). However, most of them used relatively simple cases in term of number of tools involved, number of PM tasks, assumption of PM duration, scheduling horizon, and the way of modelling a system (Mosley et al., 1998;Charles et al., 2003;Suryadi & Papageorgiou, 2004;Yao et al., 2007;Cassady & Kutanoglu, 2005;Crespo Marquez et al., 2006;Moghaddam and Usher, 2011;Xiao et al., 2016;Chang, 2018). To deal with process planning, scheduling and PM decisions, minimizing the total completion time of a set of jobs becomes an important issue (Gurel & Akturk, 2008;Liao et al., 2017;Xiao et al., 2016).
There are many factors that should be considered simultaneously in PM scheduling. From the system point of view, system output is the function of several inputs, such as throughput, WIP, cycle time, equipment utilization, etc. Besides, realistic considerations for modelling the system should be incorporated, so that the results make sense and are applicable. Since it is difficult to use any analytical method, we combine heuristic optimization and discrete event simulation. Contributions of this study include: (1) Identifying the best arrangement for PM's start time within a PM window.
(2) Providing a framework for solving real complex systems by simultaneously utilizing meta-heuristics and discrete event simulation for optimization

Problem statement
Semiconductor manufacturing systems are highly complex systems in which many conditions and inter-dependent activities should be considered (Ghahramani et al., 2020;Qiao et al., 2020;Wang et al., 2020). There are hundreds of step sequences during wafer fabrication, and each location in wafer fabs consists of several pieces of equipment. Since equipment degrades with production run times, preventive maintenance should be performed at a periodic time interval to prevent shut-downs. Semiconductor manufacturing equipment generally has scheduled PM tasks within specific PM windows. As long as the PM time is within an allowable window, there is flexibility in the timing of the PMs. The best start time for each PM should be determined so that production losses due to maintenance downtime can be minimized, yielding an optimum throughput rate. Different PM tasks may use different measures to identify which chamber to be maintained, such as time units (e.g. hour, day, week, and month), power consumption, wafer thickness and number of wafers. For example, chemical vapor deposition (CVD) machines use the number of wafers produced together with time units to quantify PM's need (Sheu & Kuo, 2006).
In wafer fab, several products are produced by many machines and many stages in the system. This process is made more complicated by a re-entrant process, in which wafers will enter the same equipment location several times, but in different stages and sometimes different processes. This situation makes any analytical method extremely difficult. By applying meta-heuristics, such as genetic algorithms and particle swarm optimization, to generate random schemes for the start time of PM tasks, called input factors, discrete event simulation can identify the output of those input factors. The approach can identify a near-optimum scheme for the start time of PM tasks.

Scenario description
This study focuses on preventive maintenance scheduling at a Chemical Vapor Deposition (CVD). At this location, a chemical process is performed to produce high-purity, highperformance materials, and to deposit specific substances onto wafers, for desired wafer characteristics.
The materials/substrates are typically transferred by gas source or liquid source, see Figure 1. Due to the use of tools for CVD processes, parts of a tool such as chambers, furnace tubes, bell jars need to be maintained and cleaned periodically.
In this study, we consider 12 product types of wafer with varying product mix. We also take into account arrival schedules and their lot size. Each product moves from one facility (a group of the tool) to another with different sequences. There are several machines in a facility, where each machine consists of 1-3 chambers. A wafer may enter the same facility several times but in different stages (re-entrant process). We consider 15 different PM tasks with their PM window and duration for each machine, such as lamp exchange, heater exchange, valve exchange, etc. Since it is difficult to solve analytically, we use simulation-optimization to mimic the wafer fab process and solve PM scheduling problem. Figure 2 shows the flowchart of research methodology used in this study for preventive maintenance optimization.

Methodology
Symbol definitions i: Index for tool, i = 1, 2, . . . I j: Index for chamber in a tool. The total number of chambers in tool i is J i . j = 1,2, . . ., J i s: Index for the PM tasks. The maximum number of PM tasks at chamber j of tool i is S ij . s = 1, 2, . . ., S ij . t: Index for time. t = 1,2, . . ., T where T is the planning horizon. ST ijs : Start time for PM task s at chamber j of tool i. PM ijs ðtÞ: Binary decision variable regarding PM task s at chamber j of tool i at time t.
The chamber is preventively maintained. Parameters L ijs : Lower bound of PM window for PM task s at chamber j of tool i. U ijs : Upper bound of PM window PM task s at chamber j of tool i. P: Population size. C: Crossover rate. M: Mutation rate. d ijs : Duration for PM task s at chamber j of tool i. Ch i : The maximum number of chambers that can be maintained at the same time in tool i. Objective Function:In PM scheduling, two kinds of objective functions are tested for evaluating the proposed approaches, maximize throughput and minimize manpower, which are common in the semiconductor industry. Please note that we perform simulation-optimization separately for these two different objective functions and compare the results.

(a) Maximize throughput rate
The goal is to maximize throughout rate within the planning horizon in order to increase the number of wafers produced.

(a) Minimize manpower
The goal is to minimize the level of resources (manpower) within the planning horizon.
We use discrete-event simulation to get the throughput rate and the level of manpower. Although we set the objective function based on that indicator, we also collect information about cycle time, work in process (WIP) and equipment utilization at the end of simulation.
Constraints Three constraints are identified as follows: (1) The start time of a PM task must lie within the PM window for the corresponding chamber of the corresponding tool.
(2) There is no interruption during PM and there is only one PM at a time.
(3) Maximum number of chambers that can be maintained at the same time in a tool Table 1 presents the problem addressed in this study. In semiconductor manufacturing, e.g. in Taiwan, there are different measures to identify the need for maintenance services such as number of wafers produced (pieces; PCS), units of time (hours; H), thickness, power consumption, or another specification unit (Spec). Since each PM task has a specific PM window, the start time for maintenance must be done within this window.

Design of genetic algorithm
This section explains the design of a genetic algorithm (GA) and the procedure for optimizing preventive maintenance schedules. GA is a well-known metaheuristic method for optimization problem with large spaces . In this study, Visual C++ and Flexsim 4.3 as discrete event simulators are simultaneously used in a genetic algorithm procedure, as shown in Figure 3. Solutions to the PM problem are modeled as chromosomes, where each gene of a chromosome represents a start time of PM, see Figure 4. For the initial population of a genetic algorithm, we generate genes randomly. To ensure that solution is feasible, we consider constraints such as the start time of PMs must lie within the PM window (L ijs ≤ ST ijs ≤ U ijs ), only one chamber can be maintained at a time, and only one PM can be done at the same time. A redo action is performed whenever an infeasible solution exists during the initial population step.
Find the best level of start time within PM window Since genetic algorithms are sensitive to the parameters such as population size (P), crossover rate (C) and mutation rate (M), experiments are conducted to find the near-optimum values for these parameters, as in (Pongcharoen et al., 2002) and (Yang et al., 2007). We use a practical problem with a one-month planning horizon, as described in section 4, for the experimented design. Table 2 presents the results of regression for 3 K factorial design with three replications of a one-month planning horizon. Table 2 shows that constant, P, C and C 2 are significant factors since the p-values for these predictors are less than 0.05. The coefficient of constants P, C and C 2 are 4177, 1.43, 112.3, 78.2 with standard errors of 15.09, 0.48, 31.82, and 21.74 respectively.
The output of throughput is modelled with the following formulae: According to equation 4, a high population would significantly increase the throughput rate. Since a high throughput rate is preferred and the highest level of experimental factors' population size is 60, we use the population size of 60 for further study. By taking the first derivative of equation 4 with respect to crossover rate and constraining it with the level of crossover as in the experimental factors, the best crossover rate for this study is 0.718 ( dy dc ¼ 112:3 À 156:4C ¼ 0 ; 0:3 � C � 0:9). Since the mutation rate is not a significant factor, and a low level is commonly adopted, the mutation rate of 0.02 is used. Hence, the values of population size, crossover rate and mutation rate are 60, 0.718 and 0.02, respectively. Parameter tuning for metaheuristics is an important step to make a  good result. Taguchi's experimental design, big bang-big crunch, grid search and other intelligent optimization methods can be used as alternatives for parameter tuning (Zhang et al., 2018, Gao, Zhou, et al. 2019Wang & Kumbasar, 2019).
Combining genetic algorithm and discrete event simulation is time-consuming. We implement several methods to reduce computation time as follows: (1) Simulation is only performed for feasible solutions, and a penalty is directly added to infeasible chromosomes without running a simulation.
(2) Converting all codes of Flexsim simulation model into C++ language. The default language in Flexsim 4.3 is Flexscript.
(3) 3D graphical rendering of the model is closed to reduce computation time dramatically. (4) Three personal computers are used to run multiple models simultaneously.
A penalty is applied whenever an infeasible solution exists. The decision to determine whether chromosome generates feasible or infeasible solution is based on a violation of the existing constraints, as described in the previous section. For infeasible solution, we directly add a small number to that chromosome as the output. However, for feasible chromosomes, the discrete-event simulator evaluates the output. The details of the mechanism for integrating the genetic algorithm in Visual C++ and the discrete-event simulator (e.g. Flexsim) are described through the diagram and pseudo-code, as shown in Figure 5 and Table 3.
The method to integrate the genetic algorithm in Visual C++ and Flexsim is as follows. We send a feasible solution to the specific file in drive C and then the algorithm in Visual C++ call model of the system in Flexsim and wait until the Flexsim model closed by itself. When the Flexsim model is called, it will import data from drive C and run a simulation. At the end of simulation, Flexsim sends the result to drive C and then automatically closed by itself. The simulation result is read through an algorithm in Visual C++ as the output of the chromosome. By making this relationship, we get a near-optimal solution at the end. In the fitness evaluation, the chromosome with the highest output (e.g. throughput) earns the highest fitness, and the chromosome with the lowest output earns the lowest fitness. This term can be obtained in the following form:  -READ data from EXCEL file and START simulation -When finish, SEND the result to EXCEL file (.CSV file) and exit -READ data from EXCEL file as fitness value of chromosome ELSE Add PENALTY as fitness value of chromosome WHILE there is no improvement of fitness value from the best value so far or maximum iterations is attained To select pairs of parents for the next generation, the roulette-wheel principle is adopted by considering the probability that the cth chromosome would be chosen (P c ), as presented in equation 6 (Yang et al., 2007). Therefore, chromosomes with a high fitness value have a high chance of being selected as the next generation's parents.
We use crossover rate and mutation rate to determine which chromosomes need to undergo crossover and mutation. Table 4 shows a two-crossover point in this study. The stopping criterion in genetic algorithm is achieved if there is no significant improvement from the previous best so far (gap < 0.05) or when the results converge.

Results
This study compares three solution approaches, genetic algorithm, particle swarm optimization, and resource leveling. Similar to GA, PSO evaluates multiple solutions (particles) at a time. Particles in PSO keep track of their best performances so far as their personal best (Dong & Zhou, 2017;Gao, Cao, et al., 2019). The simulation approaches of GA and PSO had the simultaneous advantages of meta-heuristic and discrete event simulation. The results of resource leveling from a previous study (Hsu, 2006) are used for comparison. We use two different kinds of problem to determine the performance of approaches for solving the problem, as follows: (1) Practical problem This model's problem is based on an existing system in the Chemical Vapor Deposition (CVD) location in a semiconductor manufacturing system. Table 5 presents the tool group data. For comparison, we use five performance indicators to identify the goodness of the approaches for a 195-day (6.5 months) planning horizon with 30 replications. MANOVA (Multivariate Analysis Of Variance) is used to verify the approaches' significance because there are five performance indicators. Table 6 shows the example result of MANOVA testing in Minitab 14.0, where GA-DES performs better than PSO-DES since the p-value less than 0.05. Table 7 presents a summary of all MANOVA test. It shows that GA-DES performs better than other approaches in the case of minimizing manpower. Nevertheless, under the objective of maximizing throughput, GA-DES and PSO-DES are not statistically different at the p = 0.05 level. However, both are better than the existing reference and resource leveling (Hsu, 2006). In terms of hiring manpower, GA-DES and PSO-DES are 11.76 -17.65% less than the reference.
2. Additional problems or Scenarios Table 8 shows five additional scenarios based on artificial perturbations of the real data in the previous section.
Based on PM's best start time for each approach in solving different scenarios, 30 replications of the simulation are performed, as summarized in Table 9. The maximum number of manpower is associated with the hiring level. Most of the results show that the hiring level of GA-DES and PSO-DES are similar. Nevertheless, the hiring level of GA-DES and PSO-DES can be up to 15.38% less than resource leveling, except for scenario 2.
Because there are multiple outputs, we then use MANOVA to evaluate the performance of three problem-solving approaches. GA-DES and PSO-DES perform better than resource leveling for all scenarios. However, there is not enough statistical evidence at the 0.05 level to conclude that GA-DES performs differently from PSO-DES. In general, the output model based on the PM's start time of GA-DES are statistically equal to the output model based on the PM's start time of PSO-DES.

Conclusions
Preventive maintenance scheduling in the semiconductor industry is a complex problem where many independent variables are simultaneously involved. Many studies in the literature address preventive maintenance scheduling problem (PMSP). However, most of the existing studies use simple cases of PMSP in terms of number of tools involved, number of PM tasks, assumption of PM duration, and scheduling horizon as well as in  '≡' means there is not enough evidence at the P = 0.05 level to conclude that methods are different each other modelling a real system. In this study, we use a real-world problem in the wafer fab. Reentrant process, which is a common phenomenon in the wafer fab, makes problem becomes more challenging to solve. We propose a framework for solving preventive maintenance scheduling, mainly to find the best start time of PM within a PM window in order to minimize production losses due to maintenance activities. We use discrete event simulation to mimic a real complex system in the wafer fab and embed it in metaheuristics (e.g. GA, PSO) algorithm to get the near-optimum results. Based on the PM's start time of the proposed approach, we collect system output such as throughput, WIP, cycle time, equipment utilization and the level of manpower.
The results show that combining metaheuristics and discrete event simulation performs better than resource leveling and the baseline. Although computation time is still too long, this study provides a promising approach for solving real complex systems by utilizing metaheuristics and discrete event simulation for optimization.

Future research
We acknowledge some limitations of this paper and suggest a few points to extend this study. Embedding discrete event simulation in metaheuristics algorithm for solving a real complex system is time-consuming. Therefore, designing a more efficient approximation or analytical method would certainly represent an important research avenue. The use of technology for intelligent maintenance through data-driven based fault prognosis will become an important research direction (Zhong et al., 2020). In this study, we use First In First Out (FIFO) as the dispatching rule. Hence further study could make the dispatching rule as independent variables. Including maintenance cost to determine PM's efficiency and considering safety stock policy due to maintenance activities are also the potential research extension.

Disclosure statement
No potential conflict of interest was reported by the author(s).