A multi-fidelity modelling approach for airline disruption management using simulation

Abstract Disruption is a serious and common problem for the airline industry. High utilisation of aircraft and airport resources mean that disruptive events can have large knock-on effects for the rest of the schedule. The airline must rearrange their schedule to reduce the impact. The focus in this paper is on the Aircraft Recovery Problem. The complexity and uncertainty involved in the industry makes this a difficult problem to solve. Many deterministic modelling approaches have been proposed, but these struggle to handle the inherent variability in the problem. This paper proposes a multi-fidelity modelling framework, enabling uncertain elements of the environment to be included within the decision making process. We combine a deterministic integer program to find initial solutions and a novel simulation optimisation procedure to improve these solutions. This allows the solutions to be evaluated whilst accounting for the uncertainty of the problem. The empirical evaluation suggests that the combination consistently finds good rescheduling options.


Introduction
Disruption to airline schedules creates major issues within the aviation industry. During planning stages, airlines use optimisation algorithms to produce schedules maximising profit with high utilisation of their resources. Developments in this field are enabling schedules with higher utilisation of resources whilst including some slack to recover from minor disruptions. However, due to complexity and high levels of uncertainty in the industry, it is rare that flight programmes operate as initially planned. The Federal Aviation Administration estimated that the total cost of delays in the U.S.A. were $33 billion in 2019 (Federal Aviation Administration, 2020). In the same year, only 77.6% of flights in Europe arrived within 15 min of the schedule, with 4.3% of flights experiencing departure delays exceeding 1 h (EUROCONTROL, 2020). Disruptive events have various roots which Bratu and Barnhart (2006) classify as either shortages in airline resources (such as aircraft requiring unscheduled maintenance) or shortages in airport resources (such as reductions in aircraft movements due to weather conditions). In addition, an airline's policy of waiting for connecting passengers can also cause delays. The impacts of such events can propagate through the system causing further delays and cancellations, particularly when the airline has high aircraft utilisation and airport resources operate at or near full capacity.
When a disruption occurs, the Operations Control Centre (OCC) of the airline must propose alterations to its schedule to Air Traffic Control (ATC) for their approval, bearing in mind the repercussions on finances, reputation and passengers. ATC has the right to reject proposals, in which case OCC needs to be able to suggest "good but different" solutions. The OCC wants to identify good actions and evaluate them from the perspective of the airline and the passengers. Kohl et al. (2007) describe the practice of the OCC during disrupted operations, stating that the problem is often split into three: aircraft, crew and passengers. This paper focusses on the Aircraft Recovery Problem (ARP). The available options for the ARP include delaying or cancelling flights, or exchanging aircraft.
The financial costs of a disruption can be significant. However, there are other concerns for the airline too. The airline will not want to incur high levels of delay, as this can lead to discontented passengers and affect reputation. Neither will it want to make too many changes to its schedule. Rerouting aircraft can interfere with crew and future fleet maintenance schedules and gate assignments (Thengvall et al., 2000). As the ARP solution must be used in conjunction with the crew recovery solution, minimising schedule alterations helps the integration. Whilst both considerations do have financial consequences, these can be difficult to quantify, making this a multi-objective problem.
The complexity of the industry can make it difficult to determine the consequences of schedule alterations. Operational research techniques can be of use here, helping to search for good solutions (Kohl et al., 2007). Some of the complexity can be represented using deterministic models, a number of which have been proposed (Clausen et al., 2010). However, these models cannot fully account for the various uncertain elements of the environment, such as runway queueing times, which can affect the performance and viability of solutions. This raises the question of how the inherent variability of airline operations can be included within the search for the best recovery action, which is the focus of this paper.
A paradigm for modelling the uncertainty is stochastic simulation. In a simulation model, both complexity and uncertainty are explicitly represented. However, simulation does not provide a natural strategy for searching through the possible revised schedules. This search must cope with combinatorial constraints, a difficulty for simulation optimisation. One potential approach is to consider multi-fidelity modelling, combining multiple modelling paradigms of varying accuracy, each used to gain an insight into the problem.
This paper presents a multi-fidelity modelling framework for the ARP, which is designed to address key features of the real problem. Overall it recognises and takes advantage of the multi-objective nature of the problem to generate a range of solutions. Each of these solutions is found by combining the abilities of: 1. Integer Programming (IP) to efficiently find good solutions to a highly constrained deterministic problem, allowing the discrete reassignment aircraft allocation problem to be solved within appropriate time scales; 2. simulation to evaluate the performance of complex stochastic systems; and 3. a simulation optimisation algorithm designed to improve the solutions from the IP.
In our context, we take the 'low-fidelity' model to mean the IP, as it ignores uncertainty and works in discrete time. The simulation will be referred to as a 'high-fidelity model', as it aims to represent much more of the real system than the IP does.
The primary contribution of this paper is this framework, allowing stochastic information from the simulation to be used within the optimisation. To demonstrate the feasibility and potential benefits of the multi-fidelity modelling framework, the proposed approach is applied to three realistic (but not real) test cases which are designed to reflect the main features of this problem, including realistic flight schedules and disruptions, as well as the multi-criteria nature of the decision making process. Data used in these test cases are given in the Supplementary Material, including sources where available in case other researchers wish to apply other modelling frameworks. Early proof-of-concept work was presented in Rhodes-Leader et al. (2018).
The remainder of this paper is structured as follows. Section 2 discusses related work in airline disruption management and multi-fidelity modelling. The formal problem statement and method overview are in Section 3. Section 4 describes the low-fidelity IP model and how it is used to find "good but different" solutions to the deterministic problem. Section 5 describes the high-fidelity simulation model, and how simulation optimisation is used to improve upon the IP generated solutions. This includes how we extended the STRONG algorithm for simulation optimisation to deal with the constraints of the ARP. Computational experiments and results are discussed in Section 6, followed by conclusions in Section 7.

Related work
Airline disruption management has received much attention within the literature, with many deterministic models and solution methods proposed. Clausen et al. (2010) review the area. Rosenberger et al. (2003) use a set-packing IP formulation along with a problem reduction heuristic for the ARP. Recently, advances in solvers have allowed more sophisticated exact solution methods, such as column and row generation. These have been applied in both the ARP (Liang et al., 2018) and the combined aircraft and crew problem (Maher, 2016). Heuristic approaches include Steepest Ascent Local Search (Løve et al., 2005), Genetic Algorithms (Jeng, 2012) and Large Neighbourhood Searches (Sinclair et al., 2014).
One class of models reported in the ARP literature is time-space networks. These consider schedules as networks where flights are arcs between nodes representing airports at a given time. Examples of the use of these models include Thengvall et al. (2000) for the ARP, Bratu and Barnhart (2006) (aircraft and passenger) and Zhang et al. (2015) (aircraft and crew).
As with many real problems, the ARP has a number of performance measures of interest including costs, passenger delays and resuming normal operations quickly (Clausen et al., 2010). Thengvall et al. (2000) note the trade-off between maximum revenue and minimal schedule alterations, though not explicitly including these in the model. Hu et al. (2017) propose a full multi-objective formulation, trying to identify Pareto optimal solutions, with objectives considering disruption cost, number of aircraft rerouted and maximum delay. The authors use a local neighbourhood search heuristic combined with an -constraint method. Rosenberger et al. (2003) conclude that a robust solution accounting for uncertainty may perform better than deterministically optimal solutions. Zhu et al. (2015) incorporate some uncertainty, assuming the duration of unplanned maintenance is unknown. They propose a two-stage recourse stochastic program, the first allocating aircraft to rotations, then retiming flights. The scenarios are the ready times of aircraft requiring unplanned maintenance. However, including scenarios with other uncertain elements would greatly increase the problem size, making it computationally intractable when decisions must be made quickly. Rosenberger et al. (2000) develop a stochastic discrete event simulation model of airline operations known as SimAir to simulate the schedule over long periods (weeks or months). This is extended by Rosenberger et al. (2002) and Lee et al. (2003). The purpose of SimAir lies in the schedule planning stage, providing a method to evaluate recovery policies on a long term schedule and monitor its robustness. There is also scope to use simulation at the operational level. A simulation to perform "what if" analysis could be valuable to the OCC (Kohl et al., 2007). Abdelghany et al. (2008) propose a network simulation to predict which flights will be disrupted by considering future aircraft and crew connections. Only these flights are included in an IP model for the aircraft and crew recovery problem. Hutchison and Hill (2001) propose a Simultaneous Perturbation Stochastic Approximation simulation optimisation process to change the delays on flights of Ground Delay Programmes under poor weather conditions. This did not involve reallocating aircraft.
Stochastic turn times and flight durations within the ARP are considered by Arias et al. (2013), combining a deterministic constraint programming approach with a Monte Carlo simulation to try to minimise delay. The simulation evaluates the constraint program solution under several scenarios by adding normal noise to the turn times and flight durations. If the solution is not robust across these scenarios, it is rejected and the process begins again. Guimarans et al. (2015) expand this approach by combining the constraint program with a Large Neighbourhood Search to propose new solutions and use the simulation at each proposed solution to evaluate the acceptance criteria. However, the simulation remains rudimentary, and using a higher fidelity model would harm the performance of the algorithm if the simulation is used at every iteration. Of the current literature, our proposal most closely resembles the approaches of Arias et al. (2013) and Guimarans et al. (2015), with the aim to incorporate a higher-fidelity simulation model. Wang et al. (2019) develop a discrete event dynamic system simulation model for the ARP. Rather than using optimisation, which flights to delay or cancel is based on a set of rule-based methods. Reallocating aircraft is not considered. The simulation is used to identify the best proposed solution.
Multi-fidelity modelling could improve the efficiency of simulation optimisation by reducing computation times. The simpler, or 'low-fidelity', models can quickly identify potentially good solutions to be evaluated using the highest fidelity model for an improved understanding of their performance. These are prevalent in simheuristics (Juan et al., 2015), which often use deterministic models and heuristic searches to identify solutions to test in the simulation. Examples include combining a linear program with a simulation for production planning in a manufacturing context (Bang & Kim, 2010), a deterministic model with simulation for the inventory routing problem with stochastic demands (Onggo et al., 2019) and infinite-server queueing models and simulation for staffing A&E departments (Izady & Worthington, 2012). Xu et al. (2016) argue that low-fidelity models may have inconsistent inaccuracies, changing the optimal solution. This motivates utilising the simulation information within the search process itself. For example, one could model the high-fidelity performance as the low-fidelity performance plus an error function, which could be a quadratic response surface (Osorio & Chong, 2015) or a Gaussian Process (Huang et al., 2006;Inanlouganji et al., 2018). This allows both models to be used within the search. Our algorithm is more similar to that of Jian and Henderson (2015) who use simulation optimisation to improve upon the solutions from low-fidelity fluid and Markov chain models for bike allocation within a bike-sharing system.

Problem statement and method overview
Suppose an incident has occurred that disrupts an airline's intended schedule. The airline wishes to take action to minimise the disruption's impact, returning to normal operations within a certain time period known as the recovery window. The recovery window could extend to the end of the operating day (for short-haul carriers), as in Løve et al. (2005) and Zhang et al. (2015). Maher (2016) and Hu et al. (2017) use shorter recovery windows, whilst Sinclair et al. (2014) works with periods of multiple days. Note that when the recovery window does not coincide with a break in operations, such as the day's end, one must consider the desired state of the fleet at the end of the recovery window. Maher (2016) points out that longer recovery windows can make the problems more difficult to solve.
Let A be the set of aircraft in play for resolving the disruption, wherever they are located. Here we assume all aircraft are of the same type. Let F be the set of flights originally covered by A within the recovery window, with n ¼ jFj being the number of flights. Each flight, f 2 F, will require an aircraft, a 2 A, and a planned delay time, d f , or a cancellation which must be submitted to Air Traffic Control (ATC) before the departure time. Let x ¼ fx f a 2 f0, 1g : f 2 F, a 2 Ag denote the aircraft allocation (where x f a 2 f0, 1g and x f a ¼ 1 if and only if aircraft a is assigned to flight f) and d ¼ ðd f : f 2 FÞ be the vector of planned delays. The OCC has the multiobjective aim of rescheduling to minimise alterations to the original schedule, delays and costs by setting the decision variables x and d. The costs arise directly from delays, passenger compensation and cancellation charges. Furthermore, this plan should be achievable, avoiding over-promising which could lead to flight f being delayed by more than d f , damaging the airline's reputation and resulting in operational costs as a new plan with further schedule adjustments must be submitted to ATC. These additional costs are accounted for by a penalty if the actual delay exceeds d f . This paper proposes a multi-fidelity modelling approach to the ARP with two stages. Stage one uses a low-fidelity deterministic time-space network IP with discrete time. This IP identifies an aircraft allocation to flights and gives an initial value for the planned delay of each flight, considering all objectives. Solving the IP in a multi-objective manner produces a set of rescheduling options with different priorities for these objectives. This allows the combinatorial aircraft allocation aspect of the problem to be solved in a deterministic manner, separating it from the simulation optimisation.
The second stage uses each solution from the IP as a starting point for a local search for improvement in expected cost using simulation optimisation. This search enables more detailed information from the simulation to direct the optimisation process. We fix the aircraft allocation for each solution, treating only the planned delays as decision variables, removing the combinatorial constraints. The delays are treated as continuous variables, allowing the use of continuous simulation optimisation methods and gradient information.

Low-fidelity model
In this section we describe our low-fidelity model.

Time-space network model
The low-fidelity model is based on a time-space network IP (e.g. Thengvall et al., 2000). The flight schedule covering the recovery window is represented by a sequence of flight arcs between airports and ground arcs connecting a landing to a subsequent take-off. The departure and arrival of each flight is a transition node, 2 V T , representing an airport at a particular time. Under disruption, the network is augmented with additional flight arcs for each flight starting from a later time. Let f d be the flight arc representing the flight f delayed by d minutes. The potential delays are discrete time steps of size m up to a maximum M, i.e. d 2 f0, m, 2m, :::, Mg: This introduces new nodes and ground arcs into the network. Let L denote the set of all flight arcs, while the set of flight arcs associated with flight f is L f . G is the set of all ground arcs in the network.
In addition to the transition nodes, each aircraft a 2 A has an input node, iðaÞ 2 V I , representing its location at the beginning of the recovery window. This applies to all aircraft in the fleet, regardless of the airport at which they are currently located. The input node is connected to the first node in V T at which a is available for flight by a ground arc. No other arcs leave i(a). All aircraft will end the recovery window at a return node, r 2 V R : Some aircraft may have a specified return node, denoted r(a). For example, an aircraft may be required at an airport for its scheduled maintenance. In this case, a constraint is imposed to ensure the new schedule respects the maintenance plan. Let A R denote the set of aircraft required to be at specific return nodes. The complete set of all nodes is Figure 1 shows an example network for two aircraft, A1 and A2. The original schedule is represented by the original flight arcs and the ground arcs. At the end of the day, A1 is required at BHX for maintenance, so the return node for BHX is labelled r(A1) and A R ¼ fA1g: Suppose A2 will not be ready to fly until 7:00, hence iðA2Þ is connected to node (BHX,7) rather than (BHX,6). If M ¼ 60 min and m ¼ 30 min, two additional flight arcs are added for each flight representing a delay of 30 min (dashed arcs) and 60 min (dotted arcs). The recovery window ends at 18:00, so no additional arcs are added that arrive beyond this time.
To ensure continuous aircraft movement through the network, flow constraints are imposed. Let L in and G in be the sets of flight arcs and ground arcs incident on node 2 V, and L out and G out be the sets of flight arcs and ground arcs exiting 2 V: Between flights, aircraft must be prepared for the next flight. The industry term for this is the 'turn time'. The schedule must leave a minimum turn time between flights, t min : For node representing an airport at time t , let V t min be the set of all nodes, 0 2 V T , representing the same airport at time t 0 less than t min later than time t , i.e. t 0 2 ½t , t þ t min Þ: If an aircraft ends a flight at node , it cannot start a new flight from any node in V t min : The choice of t min affects the recovery solution. Since turn time is actually variable, larger values of t min correspond to higher quantiles of the turn time distribution, creating more robust solutions. However, if t min is too large, it may delay flights that are not disrupted (if there is little buffer between flights).
A recovery plan must also ensure that the schedule beyond the recovery window is feasible with little or no disruption. For this to occur, the airline must have sufficient aircraft at each airport to operate the schedule after the recovery window. This is known as aircraft balance. Let r A be the number of aircraft required by return node r. This constraint is used so that normal operations, i.e. no delays or cancellations, are resumed by the end of the recovery window. However, the aircraft allocation may differ from the original schedule.
An airline may only be allowed to use an airport runway during particular time slots due to the airport's runway capacity constraints. A "slot" s is a time period at a particular airport, W s , with a restriction on the number of aircraft movements. The set S lists all slots. Let K s be the maximum number of aircraft movements allowed by the airline in slot s at the relevant airport W s and L s be all flight arcs impacting slot s. This mechanism also deals with curfew times at airports, for example, preventing aircraft taking off during the night.

Integer program formulation
The model formulation is adapted from the aircraft recovery model in Zhang et al. (2015), but also draws on the models of Thengvall et al. (2000) and Jeng (2012), modified to allocate specific aircraft to flights, rather than fleet assignment.
We assume the cost of delaying flight f by d is given by , where c d is the cost of delay per minute and P f is the appropriate passenger compensation cost function, with the cost per passenger often being a step function of the delay. For example, for short-haul flights arriving or departing Europe, Civil Aviation Authority (2015) state that for a delay of 2 h, the airline must cover food and drink of each passenger, whilst a delay of over 3 h leads to a compensation of e250 per passenger. However, more general delay costs could be accommodated within this framework. Let C f be the cost of cancelling flight f. Let o f a 2 f0, 1g indicate whether aircraft a was assigned to flight f before the disruption occurred.
The decision variables for the IP are all binary variables. Let x f d a indicate that aircraft a is assigned to flight arc f d , y f indicate that flight f is cancelled, and z c a indicate that aircraft a uses ground arc c. The complete formulation is as follows.
Objective (1a) relates to the cost of the recovery action, objective (1b) to the number of changes made to aircraft allocation and objective (1c) to the total planned delay. Constraint (1d) ensures that each flight is either flown once or cancelled. Constraint (1e) is the slot constraint. Constraint (1f) prevents a turn time of less than the minimum allowable turn time; if aircraft a uses a flight arc incident on node , it cannot then use a flight arc exiting or any other node 0 within t min of . Constraint (1g) is a flow constraint for each 2 V T , whilst constraints (1h) and (1i) are analogous for the input and return nodes (where applicable), respectively. Constraint (1j) ensures aircraft balance at the end of the recovery window.

Solving the integer program
To generate multiple Pareto optimal solutions, the IP is solved using the -constraint method (Haimes et al., 1971). The objectives (1b) and (1c) are added to the problem as constraints: X The program is solved several times with different constraint limits, l D , u D , l E and u E , minimising objective (1a). These parameters are modified efficiently using the method proposed by Laumanns et al. (2006) to explore various regions of the Pareto frontier. For further information about this method, see Section 1 of the accompanying supplementary material. A time limit is imposed for this application. Each iteration, defined by a set of constraint limits, is solved to optimality, except the final iteration which may run out of time.
The result is a set of solutions, X , each with an aircraft allocation (including which flights are cancelled), x, and a delay value for all flights in the programme, d. The time limit means Pareto optimal solutions cannot be guaranteed. These solutions become starting points for the simulation optimisation, seeking local improvement.

High-fidelity model
The input recovery schedule for the simulation is given by an aircraft allocation x and a set of planned delays, d. The elements of x and d link to the IP variables as follows: whilst a cancellation of flight f can be inferred when P a2A x f a ¼ 0: Let D f be the actual delay of flight f. This is a random variable with a distribution dependent on ðx, dÞ: The objective function is the disruption's expected cost: Here, c p is the penalty per minute for the actual delay D f exceeding the planned delay d f , incurred in addition to the delay cost c d D f : P f ðD f Þ represents the compensation associated with passenger delays of flight f and CðxÞ is the cost of cancellations in the allocation x. This objective function corresponds to (1a) in the IP. However, the framework could be used to handle any cost function used by an airline.

Simulation model
The simulation model is built within AnyLogic 8.2.3 (The AnyLogic Company, 2017). It simulates a homogeneous sub-fleet of a short-haul airline operating the recovery action ðx, dÞ over a set of airports during the defined recovery window. Each aircraft follows its assigned schedule, subject to stochastic flight durations, turn times, queueing times and maintenance. The general framework is similar to SimAir, as discussed in Lee et al. (2003).
Where possible, distribution parameters used Maximum Likelihood Estimators based on the available data (Flightradar24 AB, 2017), obtained using the pyflightdata Python package (Allamraju, 2017). The data set contains information on 2.8 million flights, starting or ending at a set of 163 European airports. The information consists of origin, destination, aircraft type and identification, airline, departure and arrival times, and approximately half of the entries also have scheduled departure and arrival times. Where the data source does not track this information, assumptions on distributions and parameters were made. For a more in depth reporting of the simulation model used here, see Sections 2 and 3 of the supplementary material.
Whilst the simulation does not model all the details of airline operations, it does contain the key features necessary to evaluate our multi-fidelity approach. An airline has access to their own historical data for flight durations and turn times at all airports of interest, and a detailed knowledge of aircraft time-to-failure distributions and the times to repair aircraft, based on their facilities across the network. In addition, they will know the operating procedures at the individual airports at which they operate, accurate weather forecasts, and use of general data sources such as Flightradar24 (2017) to estimate how other airline's operations may impact on runway operations. All of this information will allow an airline to improve upon the simulation model used here, making it more specialised and relevant to their circumstances.

Simulation optimisation
Solutions found using the low-fidelity model, ðx, d 0 Þ 2 X, can be evaluated with more fidelity using the simulation model. Furthermore, to account for uncertainty not present in the low-fidelity model, an improvement is sought. Fixing x for each solution removes the combinatorial aspects, leaving a continuous optimisation problem over the planned delays, d. To reduce the problem further, only the n þ flights to which the IP allocates nonzero delays, f 2 F þ ¼ ff 2 F : d f 0 >0g, are varied in the optimisation. The simulation optimisation problem is therefore where g is defined in equation (2), and the constraint (3 b) states that a flight cannot leave early and must not exceed a maximum delay u f . Thus, the feasible region, D, is a hyper-box in R n þ : We created an extended version of the STRONG simulation optimisation algorithm proposed by Chang et al. (2013) to enable it to cope with constraints. The main components of the extension are described below.
STRONG combines trust-region optimisation (see Conn et al., 2000) with Response Surface Modelling (see Kleijnen, 2014). At the j th iteration, a linear or quadratic meta-model,q j ðdÞ, is built to approximate the objective function around the current solution, d j : A new solution, d Ã j , is proposed using a sub-problem of (approximately) minimisingq j ðdÞ over an area around d j , B j , known as the trust region and determined by the trust-region radius, D j . The proposed step to d Ã j is accepted or rejected based on two tests. The first compares the observed decrease in g with the predicted decrease inq j , whilst the second is a hypothesis test to see if gðx, d Ã j Þ is significantly lower than gðx, d j Þ in the statistical sense. The trust region shrinks or expands depending on whether the metamodel produces a good solution. The primary difficulty for simulation optimisation is that gðx, dÞ and q j ðdÞ can only be estimated by taking multiple replications of the simulation at the solution ðx, dÞ: As STRONG is for unconstrained optimisation, some significant adaptations are required as the optimum may lie on the boundary of D: We briefly describe these in the remainder of this section. For further details on STRONG and the implementation of these adaptations, see Chang et al. (2013) and Section 4 of the Supplementary Materials, respectively.

Building the meta-model
When near a boundary, the classical factorial designs proposed by Chang et al. (2013) for fittinĝ q j ðdÞ are not feasible as some design points may lie outside D: A good design matrix, D j , is required to estimateq j ðdÞ: One criterion for designs is D-optimality, which seeks to minimise the generalised variance of the design, det½ðD T j D j Þ À1 (see Montgomery, 2009). Finding such a design is a complex optimisation problem. We implement the Coordinate-Exchange Algorithm heuristic proposed by Meyer and Nachtsheim (1995) to build the design in each iteration. This reduces the task to a sequence of one-dimensional problems by iteratively moving the design points around the trust region in a coordinate-wise manner to increase det½D T j D j : Whilst not guaranteeing optimality, it produces good designs quickly, an important feature as it is run at each iteration in STRONG.

Proposing a new solution
After the meta-model is estimated, the sub-problem needs to be solved. Constraints on the feasible region mean that a solution proposed by minimisinĝ q j ðdÞ in the negative gradient direction within the trust region (as in STRONG) may not be feasible. Thus we adapt a procedure to guarantee a feasible solution. For bound constraints, such as (3b), Conn et al. (1988) suggested using a hyper-box trust region, defined by the ' 1 norm, as this aligns itself with the boundaries of D: Thus, our sub-problem becomes: The objective is to find a step s that minimises the meta-model,q j ðdÞ, whilst retaining feasibility (constraint (4b)) and remaining within the trust region (constraint (4c)). Rather than solving this constrained problem exactly, Conn et al. (1988) propose finding the Generalised Cauchy Step, s Ã j , the step that minimisesq j ðdÞ along the negative gradient direction projected onto the feasible trust region. This path is known as the projected gradient path, and consists of a series of straight lines. This simplified problem can be solved exactly. For the linear model, the step is to a corner of the hyper-box D \ B j : For the quadratic models, we use Algorithm 17.3.1 of Conn et al. (2000, p. 791), which finds local minima in each line segment of the projected gradient path, and selects the smallest. The proposed solution, d Ã j ¼ d j þ s Ã j , is then simulated to estimate gðx, d Ã j Þ:

Final stages
The algorithm ends when a replication budget is exhausted or a tolerance on optimality has been met. Due to discretisation of the IP, some unnecessary delays may have been inserted. Therefore, the algorithm tests the final solution ðx, d Ã Þ against ðx, 0Þ, to check that these delays are necessary. A one-sided Welch's t-test (Welch, 1938) is used to compare: If H null is rejected, the final solution returned is ðx, 0Þ: The whole procedure is performed for each ðx, d 0 Þ 2 X, producing a set of improved solutions that have been thoroughly evaluated using the simulation. The airline can then consider each solution alongside other practicalities, before a solution is proposed to ATC.

Empirical evaluation
In practice, the IP would be solved as in Section 4.3 to produce a set of solutions and the simulation optimisation procedure of Section 5.2 would then be used once on each solution. However, when performing a computational evaluation of simulation optimisation algorithms on real-world problems, the reporting must consider two levels of uncertainty. The first is the uncertainty in the performance of an individual solution. As the exact performance is unknown, performance must be estimated using many replications of the simulation. As the rescheduling is a one-time-only decision, understanding the full distribution of outcomes is important, rather than just the expected value. Thus, the solution returned by the simulation optimisation algorithm is simulated many times for a thorough understanding of its performance. Using Common Random Numbers (CRN) improves the comparison by ensuring, as much as possible, that the solutions face the same situations.
The second layer is in the algorithm itself. The algorithm described in Section 5 relies on the random output of simulation replications. Therefore, the solution returned is dependent on the starting seed of the random number stream; a single algorithm run cannot characterise the algorithm's performance. Thus, for the purposes of evaluating this method, for each problem, we perform macro-replications of the whole algorithm (excluding the IP, which is deterministic). Each macro-replication produces a possibly unique solution, which is simulated using 1000 CRN replications.
Another difficulty is that the optimal expected cost is unknown. Thus, one cannot simply measure the optimality gap. Instead, we either evaluate the simulation optimisation performance in terms of the improvement over the IP solutions or the gap to the best solution found across the macro-replications.

Problem descriptions
Three realistic problems were created based on short-haul carrier schedules extracted from the data source (Flightradar24 AB, 2017). Problem 1 involves a fleet of 8 aircraft, one of which has a technical fault requiring a few hours to repair. Problem 2 has a similar setting, but with a fleet of 48 aircraft. In both cases, the proposed method is used to reschedule the remainder of the day's flights.
In Problem 3 poor weather at a hub airport causes high congestion at the runways, affecting 11 immediate flights and their subsequent schedules. The fleet size is 102 aircraft. Here, the airline cannot make changes to flights from the affected airport, which are excluded from the IP, but can rearrange future flights. The original schedule under the disruption is simulated to estimate when the affected aircraft become available after completing their weather disrupted flights. These times are used for the respective input node times within the IP.
The cost per minute of delay is set to c d ¼ e50 with a penalty cost of c p ¼ e20 per minute. Further details of the problems, algorithm parameter settings and results can be found in the supplementary material.

Integer program results
For each problem, the IP was solved using the Gurobi Optimizer 7.0.2 (Gurobi Optimization LLC, 2017) on a Cluster of 4 nodes each with 2 x Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.6 GHz and 512 Gb RAM, running Ubuntu Linux. For Problem 1, 8 processors were used, and 20 processors for Problems 2 and 3. The results are shown in Table 1.
Each of the problems show a clear trade-off between cost and the number of aircraft exchanges. This is predictable, since allowing more aircraft to be exchanged gives more flexibility to make schedule changes, and thus lower cost solutions become available. In each problem, the algorithm prevents cancellations and largely avoids long delays. Therefore, compensation costs are generally avoided, as these only apply to cancelled flights and delays exceeding 2 h. Thus, cost and delay are highly correlated.
Many of the solutions are achieved within two minutes. This speed would be appropriate for the application. However, for the larger problems, some of the solution times are too long, especially as the maximum allowable delay for Problem 3 is only M ¼ 60 min. Whilst the time available to solve a problem depends on the nature of the disruption, the time limits used in the literature vary from 3 min (e.g. Løve et al., 2005) to 20 min for a multiobjective approach (Hu et al., 2017). As the IP is only the first stage of our proposed framework, this suggests more specialised problem reduction and solution approaches are required.
One important observation is the time to find the first Pareto optimal solution in the larger problems: it took 8 and 12 iterations of the multi-objective algorithm for Problems 2 and 3, respectively. This suggests that the pure -constraint method could be improved upon, such as using a hybrid multi-objective to penalise aircraft exchanges that are not Pareto optimal.
Additional experimentation (not included in Table 1) indicated a further trade-off in the choice of m. A smaller value of m allows a greater choice of flight arcs; thus a more flexible solution space. This can improve solutions significantly as new aircraft exchanges become available. However, the IP grows quickly as m decreases (the number of variables is Oð1=mÞ). Moreover, Thengvall et al. (2000) found it increased the number of non-integer solutions to the linear relaxation, increasing the solution time. This is particularly clear in Problem 3. Using m ¼ 10 min instead gives an optimal cost of e15,000, but the first iteration takes 864.70 s to solve. The improvement in solution quality suggests that developing problem reduction techniques that enable a smaller step size would be advantageous for real implementation.

Simulation optimisation results
Each of the solutions in Table 1 provide a starting point for the simulation optimisation algorithm, with a budget of 2000 simulation replications, and 50 replications for the comparison with d ¼ 0: (An iteration in progress when the budget runs out is allowed to complete.) The maximum number of samples at each design point to build the metamodelq j ðdÞ is 50. Up to 100 replications are used for the acceptance tests. The full set of algorithm parameters used is listed in Section 5 of the supplementary material.
Each experiment is repeated multiple times (100 for Problem 1, and 50 for Problems 2 and 3). This produces either 100 or 50 (possibly unique) solutions for each starting point. To evaluate these solutions, each is simulated 1000 times using CRN to estimate its cost distribution.
For comparison, each starting point and a "No Action" response are also simulated using the same CRN replications. This solution involves no changes to the schedule; each disrupted aircraft operates its original schedule once it is available to do so, generally leading to large unplanned delays. Whilst unrealistic, the "No Action" response of making no 447.72 Note: Delays are measured in minutes, and the solution time is measured in seconds. a Cumulative time of multiple iterations of the multi-objective algorithm producing cost optimal solutions but with more aircraft exchanges than the Pareto optimal solutions. changes to the schedule gives a benchmark for potential improvement. Figures 2, 3, and 4 show the Empirical Cumulative Distribution Functions (ECDFs) of the simulated cost of the IP solution, the "No Action" solution and the simulation optimisation solutions (combined into one distribution). Combining the simulation optimisation solutions into one ECDF helps to account for the variance due to the starting seed, though it can hide poor performing solutions.
The ideal outcome from these plots would be the simulation optimisation ECDF never going below the IP solution ECDF. In general, whilst the mean decreases in all scenarios, the variance increases as the simulation optimisation capitalises on the scheduled buffer to reduce delays where possible. The worst case is seen in Problem 1 (Figure 2) where the upper tail of the distributions get longer. The majority of the macro-replication solutions see an increase in the 95 th percentile. However, the upper quantiles are similar to the IP solutions, giving a sense of robustness.
For Problem 2, see Figure 3, the improvement is smaller, as much of the gains have been found using the IP. For each of the starting solutions the 95 th percentile is reduced in at least 78% of macroreplications.
In the third problem, Figure 4, the gain made by the IP is fairly small. These poor initial results could be due to conservatively estimated ready times of the aircraft directly affected by the poor weather.
However, the change in aircraft allocation allows the simulation optimisation to reduce the costs. The combination leads to a much improved result. In all macro-replications, the simulation reduces both the mean and the 95 th percentile cost.
In all cases the IP and simulation optimisation solutions are more robust than the "No Action" option.
The final analysis considers the consistency of the simulation optimisation. As the aim is to minimise the expected cost, we consider the gap between the best mean cost from the macro-replications. The estimated "relative optimality gap" of the solution of the k th macro-replication of Plan p, ðx p , d kp Þ, is q gap ðx p , d kp Þ ¼ĝ ðx p , d kp ÞÀmin k fĝ ðx p , d kp Þg min k fĝ ðx p , d kp Þg Â 100%: Note that as the true optimal is unknown, this is an estimate of the optimality gap relative to the best solution discovered. Figure 5 shows the ECDFs of q gap ðx p , d kp Þ for each problem (all starting points combined into one plot). It shows that the algorithm is least consistent for Problem 1, where it is only within 5% of the best solution in 60% of macro-replications. The consistency is much higher for Problems 2 and 3, where in 90% of macro-replications it achieves within 5% of the best solution found.

Conclusions and further work
This paper has proposed a multi-fidelity modelling approach for the ARP. It combines integer programming with simulation to effectively search through the solution space. Using the simulation as part of the optimisation allows some correction for simplifications in the IP. The nature of the real problem is that OCC needs to be able to integrate the ARP solution with other parts of the recovery plan and provide it to ATC, who can reject a proposed plan. This increases the value of producing a range of good but different solutions within a limited time. The results show that this combination is  consistently able to find good solutions to the ARP, with the simulation optimisation providing improvements over the initial solutions. We believe that the combination of IP to find Pareto-optimal aircraft allocations to a deterministic version of the problem, followed by a local-search based simulation optimisation to fine tune the delays is well-suited to the ARP context.
As part of the work it was necessary to adapt the simulation optimisation algorithm STRONG to cope with bound constraints, and as such could be applicable beyond the ARP studied here.
Whilst we have demonstrated our framework across three realistic problems, further investigation of the framework across a larger variety of problem instances would improve our understanding of the performance and potentially identify if there are particular problem structures or sizes that cause a difficulty for this framework.
To move to a real implementation, the major challenge is a reduction in the computation time. The pure -constraint method for the IP could be compared with a hybrid multi-objective approach, investigating which finds the Pareto frontier sooner. Furthermore, more problem-specific solution methods could improve the solution times for the IP. The simulation optimisation procedure could be naturally parallelised by simulating each design point on a different processor. This would decrease the computation time significantly, as this is the most expensive part of the process. One may also wish to improve what the simulation explicitly models, such as passenger connections and non-linear costs. This would improve the performance estimation and increase the value of the method.