A reinforcement learning-based routing algorithm for large street networks

Abstract Evacuation planning and emergency routing systems are crucial in saving lives during disasters. Traditional emergency routing systems, despite their best efforts, often struggle to accurately capture the dynamic nature of flood conditions, road closures, and other real-time changes inherent in urban disaster logistics. This paper introduces the ReinforceRouting model, a novel approach to optimizing evacuation routes using reinforcement learning (RL). The model incorporates a unique RL environment that considers multiple criteria, such as traffic conditions, hazardous situations, and the availability of safe routes. The RL agent in this model learns optimal actions through interaction with the environment, receiving feedback in the form of rewards or penalties. The ReinforceRouting model excels in executing prompt and accurate route planning on large road networks, outperforming traditional RL algorithms and shortest-path-based algorithms. A higher safety score and episode reward of the model are demonstrated when compared to these classical methods. This innovative approach to disaster evacuation planning offers a promising avenue for enhancing the efficiency, safety, and reliability of emergency responses in dynamic urban environments.


Introduction
Changes in the global climate amplify the risk of water-related disasters such as flooding in urban areas.Since 1990, water-related disasters have accounted for 90% of the 1000 most severe disasters (Hendricks et al. 2022).They are the most frequent and expensive natural disasters nationwide, impacting millions of people's lives and thousands of communities every year.Despite the cascading impacts of these extreme events, existing climate adaptation platforms do not fully understand the dynamics of urban community-level response due to the lack of detailed disaster management and efficient evacuation plans at the street level (Gharaibeh et al. 2021).Evacuation and emergency routing systems play a critical role in saving lives and minimizing damages during disaster events.Traditional emergency routing systems usually use remotesensing images and hydrology methods to simulate the movement of the flood (Henonin et al. 2013, Feng et al. 2015) to obtain the flooding information.However, data obtained this way may result in imprecise flood prediction results due to its failure to consider reshaped surface topography and micro-topographic variations commonly seen in urban environments (Jha et al. 2012, Alizadeh et al. 2021, Kharazi and Behzadan 2021).Accurate and reliable data are crucial for the successful dispatching of rescue teams and navigating individuals from flooded areas to safe places during emergencies.Recent research (Cavdur et al. 2016, Yan et al. 2020) showed that the number of deaths is highly related to the efficiency of evaluation plans and routing during emergency events.Therefore, establishing a well-thought-out and timely evacuation strategy is critical in dynamic flooding situations where lives can be at risk (Meyer et al. 2018).
Traditional routing algorithms such as capacitated scheduling algorithm (Osman and Ram 2013), genetic algorithm (Gomes and Straub 2017), and cellular automatabased evacuation (Li et al. 2021a), have been used in the past for approximate solutions in evacuation routing during flood emergencies.However, the following limitations have been identified for the existing routing algorithms (Ding et al. 2021).Firstly, traditional routing algorithms are often based on pre-defined rules or heuristics and may not be able to adapt to real-time changes in flood conditions, road closures, or other dynamic factors.For example, in fast-changing flood situations, the traditional routing algorithms cannot update the routes in real-time based on the emergency situation (Duraipandian 2019, Ding et al. 2021).Second, traditional routing algorithms struggle with scalability issues when dealing with large-scale flood evacuations involving a large number of affected individuals, multiple evacuation routes, and varying capacities of transportation resources.The computational complexity of these algorithms may increase significantly with the size and complexity of the evacuation scenario, leading to longer computation times and potential delays evacuation process.Finally, traditional routing algorithms lack the ability to deal with diverse data sources, such as gauge data, information from first responders (eg volunteered geographic information such as New York City 311 data), or inputs from affected individuals.The traditional methods rely solely on a few predefined parameters or heuristics, which makes it hard to capture the complexity and dynamics of flood emergencies.To overcome these limitations, innovative approaches, such as reinforcement learning techniques can be used.
Recent contributions in deep reinforcement learning (RL) (Sutton and Barto 2018) have shown unique advantages in solving conditional routing problems (Nazari et al. 2018, Levy et al. 2020).In RL frameworks, agents represent individual people or a vehicle in a navigation environment, which are simulated to iteratively learn the policies to maximize the reward feedback through interacting with an environment.This unique trait of RL makes it a natural choice for many data mining problems requiring incremental decisions.For example, RL can learn to solve complex route optimization problems in a dynamic environment by collecting experience, while traditional algorithms focus more on static environments (Qiu et al. 2019).Traditional methods also usually need to recalculate the route when the road network changes, but the RL approach can incrementally adapt to an unknown environment without retraining the entire data to reduce the computational time (Su et al. 2004).Furthermore, when a route optimization problem becomes a complex sequential decision-making process due to unpredictable natural or artificial causes, solving those problems by traditional algorithms also becomes extremely complex or even NP-hard (Abe et al. 2019).
Although RL has been widely adopted in previous work for pathfinding and route optimization (Godfrey and Powell 2002, Xiong et al. 2017, Wei et al. 2018, Kim and Kim 2021), there are still many limitations to the current RL routing optimization approach.For instance, using RL in large graph networks can be challenging due to inefficient exploration strategies.Here, the RL exploration refers to the process of discovering and learning from new states and actions to improve the agent's decisionmaking.Traditional RL methods may suffer from exploration issues in large graphs, as the search space can be large, leading to long computation times and sub-optimal solutions (Arora et al. 2017, Manchanda et al. 2019).Moreover, action settings in previous graph routing environments lack consideration of practical usage in the real world.For example, Levy et al. (2020) used compass direction (North, Northeast, East, Southeast, South, Southwest, West, Northwest) as an action space, which ignores the actual road network structure and causes invalid actions.For example, an agent will keep the same direction and slightly turn left or right when the agent wants to exit from a controlled-access highway to a freeway.Sharma et al. (2021) directly used nodes as an action space for fire evacuation route planning, but it is not applicable in a large network system with tens of thousands of nodes.
To address the challenges mentioned above, we developed a geospatial cyberinfrastructure-enabled reinforcement learning algorithm to improve the routing efficiency in a large real-world road network.The algorithm was trained using the National Science Foundation-funded FASTER (Acquisition of FASTER -Fostering Accelerated Sciences Transformation Education and Research) supercomputer to handle large road networks in a spatial database and support training data-loading tasks (Li and Zhang 2021).
Our RL routing algorithm uniquely considers multiple factors, including safety, reliability, and efficiency, in routing calculations under dynamic and variable weather and flooding conditions.This work makes several significant contributions to the fields of routing and emergency management research: � We developed a novel graph-based RL environment, complete with efficient action and reward policies, which facilitates sophisticated routing optimization.
� We integrated state-of-the-art reward optimization methods to train our graphbased RL algorithm, even under the complexities of large graph-based environments.� We successfully scaled up the RL agent to accommodate large graph-based maps (eg over one thousand nodes) through the use of behavior cloning optimization methods.� We incorporated near-real-time flood data in our experimental tests to assess the performance of the graph-based RL algorithm under realistic conditions.
The remainder of this article is organized as follows.Section 2 introduces the literature background of our research.Section 3 presents the procedures of data preparation.Section 4 describes our methodology of flooding environment development and the details of various optimization techniques.Section 5 implements experiments using the proposed method and analyzes the results.Section 6 discusses this method for different applications.The last section discusses conclusions drawn from the study.

Background
Floods are one of the most common hazards in the United States (Perry 2000, Zhang et al. 2014, 2019, Xu et al. 2020, Li et al. 2021b), which cause widespread devastation, resulting in the loss of life and damages to personal property and critical public health infrastructure.During a disaster, rapid response and effective evacuation activities are important in minimizing the loss of life or harm to the public during natural disasters (Murray-Tuite and Wolshon 2013, Huang et al. 2016, Yang andShekhar 2017).Recent studies (Hai-bo and Yu-bo 2017, Qiu et al. 2019, Yin et al. 2019, Li et al. 2021a) found that traffic delays and unknown road conditions are the biggest challenges for efficient rescue.Moreover, in Lim et al. (2013)'s comprehensive review, flood evacuation models with different optimized variables, including travel times, travel costs, unknown traffic, travel distance, and identification of evacuation routes with an emphasis on flooding situations, are the main difficulties in flood disaster management.Many researchers focused on modeling flood evacuation as a decision-support system that combines all the information to build various spatial analysis models to help decision-makers (Liu et al. 2006, Zhang et al. 2016, Lee et al. 2020).For example, Liu et al. (2006) developed an adaptive evacuation route model based on the traditional Dijkstra shortest path algorithm (Dijkstra 1959) to pursue the goal of minimizing the total evacuation time.Later, Zhang et al. (2016) developed a GIS-based decision support system that can acquire situational information on flood evolution, feasible routes, and high-risk areas for the flood detention basin.To evaluate the performance of the flood evacuation model, Li et al. (2019b) simulated the flood evacuation with a multi-agent system in a virtual reality environment.In recent studies, Lee et al. (2020) modeled the spatial and temporal inundation information with a non-linear auto-regressive model to plan the evacuation route, and (Li et al. 2021a) developed an algorithm that couples high-resolution hydrodynamic and cellular automatabased evacuation route planning for flooding situations.
However, these existing routing algorithms suffer from several limitations (Delling et al. 2012, Chen et al. 2014, Zhang, Yang, and Zhao 2016).They are often inefficient in navigating complex and dynamic road networks.This is because they rely on predefined road information, which may not accurately reflect the actual road conditions during flooding events (Zhang, Yang, and Zhao 2016).Moreover, Staroverov et al. (2020) found that traditional routing algorithms are not well-suited for real-time updates and are limited in their ability to quickly adapt to changing road conditions.For example, the capacitated scheduling algorithm and genetic algorithm can only find approximate solutions (Kumari and Geethanjali 2010), and the cellular automatabased evacuation model is limited by the accuracy of the given information (Trindade et al. 2016).Only a few research are focused on developing routing algorithms that can be applied to real-time changing road networks with less external traffic data.For example, Delling et al. (2012) developed a robust mobile route planning model with limited connectivity information in the road network.Similarly, Mirahadi and McCabe (2021) proposed an evacuation management model that uses Dijkstra's algorithms to dynamically calculate and foresee consequences, and thus create an evacuation decision-support strategy.However, classical routing algorithms often assumed that there is only one objective and that the problem's preset environment is completely deterministic.For example, Chen et al. (2014)'s path optimization model for vehicle evacuation uses the greedy methodology that tries out all possible routes in a network, but the model cannot work in a changing environment with unpredictable weather conditions and other social factors.Machine Learning (ML) researchers take advantage of the increasing computing power of GPUs to adapt a supervised neural network to solve the pathfinding problem and demonstrate the potential ability to solve many other location-based problems (Wang et al. 2009, Kumari and Geethanjali 2010, Yin et al. 2023).More recently, alternative methods have been proposed utilizing clustering-based methods with evaluation algorithms for Vehicular ad hoc networks (VANETs).For instance, based on Bagherlou and Ghaffari (2018)'s proposed routing protocol with simulated annealing and radial basis function (RBF) neural networks, Mohammadnezhad and Ghaffari (2019) developed a reliable routing algorithm using simulated annealing for clustering and radial basis function neural networks for cluster head selection, showing efficiency in terms of route discovery rate and packet delivery rate.Later Kheradmand et al. (2022) improved the previous method's performance using Harris Hawks Optimization (HHO).These methods, although efficient, rely on the specific conditions of the network and the number of generated clusters.
A particular challenge in ML-based path planning models concerns the generalization issue and the selection of training and test data.This challenge raises the questions of (1) whether the trained ML model will still work in a different environment (eg a different city) and (2) whether it will work if we lack training and test data.Deep reinforcement learning (DRL), a type of machine learning technique (Sutton and Barto 2018), has recently been employed in such complex sequential decision-making processes to minimize loss and maximize the long-term gain of an intelligent agent.The choice of DRL in our project was natural because many real-world path planning problems require an incremental decision-making process, and the RL method has no dependencies on batch path training datasets (Zhang et al. 2021).Moreover, compared with classical path-finding methods (such as the Dijkstra algorithm and A � algorithm (Hart et al. 1968)) and other supervised machine learning methods, RL-based route planning can gain experience by interacting through a memorized reward function with the training environment and evaluating the feedback from the training environments, eventually performing as a selfadjusted intelligent decision-support agent for real-world users (Bi et al. 2019).In Chamola et al. (2021)'s survey about machine learning in disaster management, the DRL method is regarded as a self-sustainable system -this unique trait makes it very promising to support disaster management research.
Many researchers have been working on integrating RL techniques to navigate vehicles in various scenarios Walker et al. (2019), Levy et al. (2020), Sharma et al. (2021), Wang et al. (2021), He et al. (2022).Levy et al. (2020) used DRL to develop SafeRoute algorithms to help pedestrians safely navigate the city by avoiding street harassment and crime.However, their limited consideration of real-time changing street conditions resulted in no significant improvement over simple avoidance routing.Machine learning or reinforcement learning-based routing algorithms cannot navigate in large road networks due to scalability issues.They are limited by the size of the data they can process, which affects the accuracy and speed of the routing result (Geng et al. 2021).For example, both Tian and Jiang (2018) and Sharma et al. (2021) have considered using DRL to design evacuation routes for fire disasters in building environments, but only indoors due to scalability issues.DRL methods have also been applied to complex multi-task path-finding scenarios Wang et al. (2019Wang et al. ( , 2021)).(Wang et al. 2019) and Wang et al. (2021) have proposed using multi-agent DRL methods to help a group of vehicles design efficient routes in complex environments.Other research (Shi et al. 2023) takes advantage of the traffic light to optimize the travel time using DRL.However, no existing studies have provided an efficient solution to handle large-scale, real-world flood evacuation events.
Sample efficiency is another challenge that many researchers have faced when designing a routing algorithm (Kakade 2003, Sohn et al. 2021).Sohn et al. (2021) showed that achieving good learning in Markov Decision Processes (MDPs) path planning problems requires a large number of samples in RL algorithms.To address this, Pathak et al. (2017) and Seo et al. (2021) proposed using external neural network encoders to embed the environment features as intrinsic rewards to optimize the training procedures of an RL agent.Recent researchers (Christiano et al. 2017, Kumar et al. 2021) have found that integrating the behavioral cloning method (Schaal 1999) in RL training can improve the policy's performance.In a recent study on human-centered RL, Li et al. (2019a) demonstrated the importance of using human feedback in DRL, which can improve applicability to real-life problems.
In aggregate, Table 1 below summarizes the key characteristics of several routing algorithms that are used in real-world cases.According to the comprehensive survey about the routing algorithm by Tyagi et al. (2022), we included the environment settings, algorithm type, completeness, and limitations as comparison attributes.The environment setting can be either VANETs, or static, where the road network is predefined and unchanging, or dynamic, where the road network can change in realtime.The algorithm setting can include traditional methods like Dijkstra's algorithm, GIS-based methods, robust mobile routing, greedy methods, and ML methods such as supervised neural networks (NN), simulated annealing (SA), and deep reinforcement learning (DRL).The completeness of the algorithm refers to whether it provides an exact solution or a heuristic (approximate) solution.

Data collection and processing
This study can be applied to any transportation network in urban or rural areas.Point data with longitude and latitude can be used to represent the Points of Interest (POI) that a user wants to avoid.In our experiment, we used transportation network data collected from New York City and Houston.For the New York City case study, we used New York City 311 data as the POI data, and for the Houston case study, we used gauge data as the POI data.
For the transportation data, the original map information was collected from OpenStreetMap, a free collaborative world map (Bennett 2010).In our experiments (Section 5), we chose to export map information for the downtown areas of Houston and New York City.We used the Houston road network to demonstrate the algorithm's performance, and then used the New York City road network to show that our algorithm can also be applied to other cities in the US with different types of POI data to demonstrate the scalability and flexibility of our algorithm.We first converted the routing map (street network data) into graph-based data using the methods introduced in Section 4.1.For each node, the route map contains: 1. Node id: unique ID of each node.2. Longitude and Latitude of the node location.3. Elevation (meter): the elevation information is collected using OpenTopography Krishnan et al. (2011).4. Lowland (Boolean): the lowland attributed of node n is preprocessed using elevation data: where E N denotes as elevation set of neighbors of node n, E n denotes as elevation of node n.Each edge contains the following information: 1. Edge id: unique ID of the edge.
2. Node ids: The ID pair of nodes that connect the edge.

Preset attributes:
show the preset open street map attributes, including the number of lanes, highway type (eg tertiary, secondary, etc,.), one-way (Boolean), bridge (Boolean), road length (meter), speed limits (kmph), average speed (kmph).4. Grade (−90 � 90): show the slope of the road, calculated by the elevation of nodes; usually less than 30. 5. Bearing (0 � 360): show the bearing direction of the road, calculated by the Longitude and Latitude of nodes (e.g.0 represents north and 90 represents east).
When examining graph theory, a graph can be depicted by an adjacency matrix.This matrix showcases the segments of the graph, with indicators such as 0, 1 that demonstrate whether a particular edge connects to the node or not.However, mapping a real-world map into an adjacency matrix poses significant difficulties due to the intricate attribute information that the map holds.This information includes street type, street speed limit, elevation, and grade.To manage and update this data, we utilize NetworkX (Hagberg et al. 2008, Rossi andAhmed 2015).We also incorporate additional data sources, such as the BluPix application 1 and the New York City 311 data 2 , which is a public dataset recording non-emergency service requests in New York City.For the purpose of street flood analysis, we used a processed version of the New York City 311 data. 3

Graph representation in pathfinding problem
A road network can be modeled as a directed Graph G ¼ ðV, EÞ, where V ¼ fv 1 , v 2 , :::, v n g denotes a set of n vertices in graph and E ¼ fe 1 , e 2 , :::, e m g denotes the set of roads as m edges in graph.Figure 1 illustrates the comparison between the real-world representation of the routing system and the graph-based routing system.For example, Figure 1(a) shows an example of navigating from point A to point B using Google map, while Figure 1(b) represents the graph-based view Herman et al. (2000).

Markov decision process (MDP)
In the RL, the routing system can model as an MDP tuple M ¼ ðS, A, R, qÞ, where S denotes as state set, A denotes s an action set, q denotes as a transition probability, R is a reward function and q is initial state distribution (Sutton and Barto 2018).Each state s contains all the information (described in Section 4.2) that the agent observed.The value of a policy p is denoted by where r 2 R denotes a reward, and c 2 ½0, 1� denotes a discount factor, which cares for the rewards the agent achieved in the past, present, or future.Equation 2 aims to achieve the optimal policy p � which maximizes the expected return: The optimal policy p � will take the best action from the action space (described in Section 4.2) to lead the agent to reach the goal.

States
The state s 2 S, also known as an observation, represents the agent's current status on the map.In RL applications, the settings of the observation space and action space play an important role in the downstream task of RL.The state of an agent contains the current node and target node, which allows the agent to recognize that state and take appropriate action to reach the goal.If the agent observes the current node at a later training time with a target node in the opposite direction from before, the agent may take the opposite action to correct the direction.Furthermore, in the early training stage, the agent tends to be more curious, meaning the agent will try to explore more unseen states by taking as many unknown actions as possible, especially in a large environment (eg the Houston metro area).In our model, the states include two parts: graph observation and flooding observation.
In a flood scenario, the agent can get an overview of flood depth information from our graph environment.Additionally, the raw flooding information only contains longitude, latitude, and flood depth (inches).Therefore, we need to convert the flooding information to distance with uniform measures (meters).Given two nodes (the current node k and the flooded node /), distance is calculated using Equation (20) (Appendix A). Figure 2(a) shows where flooded points are located in a road network.With fixed latitude and longitude information, the agent can obtain an overview of flooding formation directly from the graph.However, using latitude and longitude coordinates to represent the flooded point does not give the agent any direct information (Levy et al. 2020).It is challenging to store all reported floods in our observation.Thus, to use the node attribute to represent flooding information in graph observation, we used the graph embedding method (generated by node2vec (Grover and Leskovec 2016)) to embed the flooded nodes.Given a k-nearest reported stop sign, the flooding observation can be expressed as a 2 � d 0 matrix s f n ¼ ðe n , e f − e n Þ, where e n denotes the node embedding of the current node, and e f denotes the sum of the embedding representation of all reported flooded points.In the real-world experiment, our environment setting becomes more complicated to optimize the simulation of real-world dynamic flooding events.So we involve a new mechanism called 'evolving' to simulate dynamic flooding (described in Section 5).

Actions
Actions in the environment represent moving from one street node to another.An agent performs an action to receive an updated state to observe the environment and prepare for the next action.An evacuation agent's action depends on its state and then determines its behavior.Figure 2(a) demonstrates how an agent takes action during training.The agent has three discrete actions (shown as green arrows) in one state.Through a policy network, the policy will eventually converge to an optimal policy that will take the best action to lead the agent to reach the target node.In our environment, the action space is discrete, which follows the human decision-making process in wayfinding, such as 'Turn left,' 'Proceed,' and 'Turn right.'However, not all nodes are located at the intersection with four sides.Some nodes are fork roads, and some nodes are highway exits with two sides.Therefore, the action needs to be masked (described in Section 4.3.2).In our study, the action was formatted as a cardinal number in the training step.

Reward function
The reward function evaluates the action of the agent in each state.In a policy network, the task of the agent is to maximize the reward function.The reward measures whether the agent reaches the target node or whether the agent falls into a flooded situation.In our environment, the agent optimizes various preferences so the reward function must be designed with several factors.Since the main goal of our evacuation routing problem is to reach the destination safely, we embed the safety factor into the reward as a function of distance from the current node to the closest flooded points.A list of the reported flooded depth is traversed and updated for each step during the training step.In the urban area, flows greater than 9 cm depth and 1.5-2 m/s velocity can generate a loss of stability for subjects weighing 50-60 kg (Russo et al. 2013) and the effect evacuation time is 10 min (Vicario et al. 2020).Even though our evacuation plan is designed for vehicle routing, the safety of passengers is equally important and needs to take into consideration.We assumed the flooded depth of over 9 cm is a dangerous flooding node and the velocity of flooding is constant at 1 m/s.Then, the reward for the flooding factor can be expressed as follows: where C 2 ½−1, 0� is an adjustable parameter in the environment setting to limit the negative reward of the flooding factor, d f is the closest distance between the current node to a dangerous flooding node and T denotes an effect evacuation distance.In Equation (4), we use a logarithm function to control drops in the reward curve during training.Through this process, the agent can learn to leverage the penalty and take a faster road when the distance between the current position and the closest flooded point is acceptable.In addition to the flooding factor, the final goal of reaching the destination also needs to be added to the reward function.Considering the AI safety issue that agents should not try to introduce or exploit the reward function to get more reward (Leike et al. 2017), the goal reward should be simple and effective (Chevalier-Boisvert et al. 2018, Sutton and Barto 2018, Levy et al. 2020).Note that the reward function of the flooding factor will also not generate any positive reward to the agent to avoid reward abuse.Thus, our reward is simply set as: Using a simple reward function in RL routing methods for evacuation scenarios can result in sparse reward environments, where most of the reward signals are equal to or less than 0.0.Sparse reward environments can pose challenges for RL algorithms as they may lead to slow learning or difficulty in finding optimal policies due to the lack of informative feedback.In such an environment, agents have to navigate (and change the underlying state of the environment) over long periods of time, without receiving much (or any) feedback (Pathak et al. 2017, Moritz et al. 2018).Section 4.3.3shows the methods for dealing with sparse rewards in our RL flooding evacuation model.

Flood evacuation model using reinforcement learning
This section introduces detailed information about the optimization process and adaptation methods.We applied our graph-based RL algorithm to a large-scale road network to support risk-informed decision-making.In our study, agents explored the study area with a flooding event in four stages.

Policy network
MDPs are the bases for an RL framework that can underlay the unknown state probability distribution and transition probability to get an optimal policy p � to decide which road should go.Several methods have been discovered to find the optimized p � (Watkins and Dayan 1992, Schulman et al. 2017).Recently, the policy-gradient method has shown state-of-the-art improvements toward graph RL research and navigation tasks (Schulman et al. 2017, Bøhn et al. 2019, Silva et al. 2020).In policy gradient methods, build upon an estimator of policy gradient and plug it into a stochastic gradient ascent algorithm that adjusts the weights of the policy towards the maximum rewards.The most common gradient estimator ĝ is given by: where p h denotes stochastic policy, Ât is the estimator of the advantage function at time step t, a is action follows a t � pða t js t Þ and s is state follows for t � 0: In this case, the expectation Êt is the empirical average over a finite trajectory s of our environment samples.Here we define the trajectory s as a sequence of state, action, and rewards: Qðs t , a t Þ and V ðs t Þ in Equation (6a) are state-action value function derived from equation (2) via MDP in the trajectory.This can be intuitively taken as the difference of Q value (future discounted rewards) and the average of action which it would have taken at state.It shows the extra reward that an agent could obtain by taking a turn in an intersection.Since ĝ is a gradient estimator like other estimators in deep learning models, it is obtained by an objective function (or loss function) that differentiates the objective: where h is the policy parameter.This policy gradient optimization makes RL frameworks more like general optimization problems where the deep learning training method is applicable.Then training method involves an objective to minimize the loss function, where the parameterized approximator can be established.Conversely, reinforcement learning consists of an unknown and non-differentiable dynamic model, which causes the increased variance of the gradient estimator.Considering these limitations, we use Proximal Policy Optimization (PPO) (Schulman et al. 2017) to optimize the policy.In PPO methods, the objective function is expressed as follows: where r t ðhÞ ¼ p h ða t , s t Þ p h old ða t , s t Þ is the ratio of the probability under the new and old policies and � is a tunable hyper-parameter (generally 0.1 or 0.2).The clip term clipðr t ðhÞ, 1 − �, 1 þ �Þ Ât in equation 9 modifies a surrogate objective function (Schulman et al. 2015) by clipping the probability ratio, where the minimum of the clipped and unclipped objective is taken and the final objective is a lower bound on the unclipped objective (Schulman et al. 2017).In the experiment Section 5, the agent adapted a well-established learnable estimator, so a tuned neural network will be introduced in the next step to train the evacuation model.

Action mask
As stated in the previous Section 4.2, the existing flood evacuation algorithms lack the capability of dealing with complex environments.For example, Sharma et al. (2021) uses nodes in observation as action to train an agent to plan an evacuation route in a fire situation.It requires the agent to first learn to memorize the neighborhood nodes and choose a valid action via Q-learning (shown in Figure 2(b)).Figure 2(c) also shows an example of action space without an active mask.Once the observation extends to a large number of nodes (eg over a thousand nodes), the Q-learning method will be extremely hard to converge (Low et al. 2019).In contrast, Figure 2(d) demonstrates how the action mask works in our flooding evacuation model, and only valid action is presented to the policy.To remove the invalid actions and adapt the policy network to our environment, we designed an action mask method to avoid invalid actions affecting our policy network, which masks all invalid actions in each state.To incorporate the change in action, the policy network in Section 4.3.1.needs the following modifications: 1.The trajectory s in Equation ( 7) only use valid actions 2. Only valid actions are calculated in Equation (6b) during the gradient descent.
The policy network first outputs logits (also known as scores without normalization in the neural network) and then converts them into an action probability with an activation function.In this case, the MDP is an action set with three directions A ¼ fa 0 , a 1 , a 2 g denoted as turn left, proceed, and turn right respectively.The decision to limit the action space to these three options was made based on a comprehensive statistic of real-world road networks Appendix D. Our studies showed that intersections requiring more than these three actions, such as 5-way intersections or situations necessitating a U-turn, constitute a minor percentage of the total intersections in major cities.Moreover, expanding the action space to include these additional actions would significantly increase the complexity of the model.Further, consider a policy p h in Equation (6b) parameterized by h ¼ ½l 0 , l 1 , l 2 � ¼ ½1:0, 1:0, 1:0�: Assume we directly use h as the output logits for easier representation.Then in an initial state s 0 that an agent in a start node, we have: Assume a 0 is invalid for state s 0 , the re-normalized probability distribution p 0 h ð�js 0 Þ will be calculated by an activation function (eg softmax) as ½0:5, 0:00, 0:5�: Thus when an agent traverses a road network, every time an agent makes a decision at an intersection, only a valid turn will be chosen by the policy network.

Exploration
In reinforcement learning, 'agent exploration' plays an important role in the training process.The existing literature indicated that agents in reinforcement learning algorithms suffer from sparse reward issues (Savinov et al. 2018).For example, the agent cannot receive a reward until it arrives at a destination in a disaster event.One solution to this problem is to design an intrinsic reward by the agent itself, following steps such as observing, memorizing, and recalling (Savinov et al. 2018).Let the intrinsic reward generated by the agent at time t be r i t , the original reward be r e t and the trade-off parameter between the exploration and exploitation be b.The optimized reward can be: In this way, a new objective of an agent can be developed by maximizing the sum of these two rewards.Pathak et al. (2017) used self-supervised representation learning to encode the observation feature for exploration.Conversely, Seo et al. (2021) encourages exploration without introducing representation learning but utilizes a k-nearest neighbor state entropy estimator in a randomly initialized encoder.In our policy network, we adapt these two intrinsic reward methods to our agents.

Reward by auto-encoder representation learning:
The auto-encoder representation learning follows the idea of Pathak et al. (2017).Given a raw observation s t , an auto-encoder neural network (Lange and Riedmiller 2010) is used to encode it to a feature vector /, where the feature vector stores the information of the road situation (eg the distance between the goal node and current node).The autoencoder consists of two modules: encoder and decoder (also called inverse model and forward model).The predicted action ât from action a t taken by the agent from the state s t can defined as: where h i and h e are the learnable parameters in the encoder neural network g e and are trained to optimize the minimal discrepancy between the original action and predicted action.In the decoder g d , the a t and /ðs t , h e Þ can be used to predict feature encoding of the s tþ1 : where f is the function that is trained to optimize the regression loss between the predicted estimate of /ðs tþ1 Þ and the actual /ðs tþ1 Þ and finally, the intrinsic reward is computed as: In the section of experiment 5, we introduce our tuned auto-encoder network for intrinsic reward agents.

Reward by random-encoder:
The random-encoder (Seo et al. 2021) is based on a k-nearest neighbor entropy estimator (Singh et al. 2003).The random encoder is based on a state entropy, which is calculated based on its distance from k-nearest neighbor states present in the replay buffer in the representation space.Let X � R q be a random variable with a probability function p.The random-encoder starts by using the differential entropy (Michalowicz et al. 2013) to create a high-dimensional feature space (or representation space) to monitor the observation of a moving agent.The high-dimensional space can be represented as HðXÞ ¼ −E x�pðxÞ ½log pðxÞ�: Since our observation of road network in low dimensional, we employed particle-based k-nearest neighbors (k-NN) as training samplers for X as follows: where x k−NN i denotes the k-NN of x i within a set fx i g N i¼1 (Seo et al. 2021).The difference between the states in the low-dimensional feature space of a randomly initialized encoder can be calculated using the distance measure.By utilizing Equation (15), we can treats each transition as a particle (Liu and Abbeel 2021), the intrinsic reward of the random-encoder can be expressed as: where y i is a fixed representation outputs from a random-encoder and y k−NN i is the k-nearest neighbors of y i .Measuring the difference between states in the feature space can enable a more stable intrinsic reward as the pair of states does not change during training (Seo et al. 2021).The comparison of those two explorations is present in Section 5.

Scaling optimization
Navigating large road networks often leads to sparse rewards, making it challenging for agents to reach their goals.The design of scaling optimization assists the agent in demonstrating robustness in real-world applications (Li et al. 2019a).Researchers, inspired by human imitation behavior, employ imitation learning (IL) (Schaal 1999) to learn from historical trajectories to construct an agent.This approach circumvents the need to learn from sparse rewards or manually specify a reward function.The resulting agent can be designed for higher horizons or multi-task goals in industry applications (Magzhan and Jani 2013).
In this section, we present a scaling optimization design that imitates expert decisions to support agent navigation in a road network comprising over 80k nodes.The behavior cloning optimization trains an agent using a sequence of decisions from human experts ŝi 2 fŝ 1 , ŝ2 , :::, ŝm g, wherein each decision comprises expert trajectories as follows: For each trajectory s in the dataset ŝ, the scaling optimization estimates the advantage function Âp ðs t , a t Þ ¼ ðR t − V h ðs t ÞÞ=c for time t ¼ 1, :::, t: This function measures the relative quality of an action at a given state compared to the average action at that state under the policy.Here, V h ðs t Þ represents the value function discussed in Section 4.1, and R t represents the modified reward function , where z is a path representation containing a sequence of node ids at timestamp t.The term z e i denotes an expert path decision.In the experiment settings (Section 5), we utilized a random sampling method to summarize the path feature and the path dimension was adjusted according to statistical results.We used the implementation of the monotonic advantage reweighted imitation learning (Wang et al. 2018) method in Ray Rllib (Liang et al. 2017) for behavior cloning optimization by maximizing the following estimation with hyperparameter b: In traditional pathfinding problems, the shortest path can be approximated using greedy or heuristic algorithms.For our experiment, we utilized a trajectory generated by a multicriteria decision-making routing algorithm under flood emergency conditions (Alizadeh et al. 2022) that we had previously collected.When the environment becomes static-ie the flood ceases to evolve and avoidance areas are no longer a concern-the routing algorithm simplifies to a general shortest path finding the problem.In such circumstances, the decisions made by these algorithms can be considered expert decisions.

Flooding environments
We can obtain flooding information in urban areas using various information sources.Using the lowland attribute we added in Section 3, if the weather of the flooding environment is rainfall or the statistic precipitation level is high, we claimed that the lowland intersections have a higher chance of being potential flooded points according to the existing flooding information.In a real-world environment, flooding conditions in urban areas are more uncertain due to weather factors.Thus, we added the 'evolving' mechanism to simulate this situation.The 'evolving' setting can be configured as a parameter of the flooding environment.If the 'evolving' setting is enabled, we will randomly choose many intersections labeled with lowland (excluding the origin point and goal point) as new flooding points with average flooded depth.For example, if the 'evolving' is configured as 10, a new flooded point will generate every 10 steps in an episode.In the next episode, the environment will also be reset to clean up all generated flooded points and keep only stop signs and gauge flooding information.

Experiments and training
Our research was demonstrated in two distinct study areas: Houston and New York City.In the case of Houston, flood data was sourced from the period of Hurricane Harvey (August 17, 2017 to September 3, 2017).The New York City environment was selected to demonstrate the scalability of our algorithm.Comprehensive information regarding our experiments is provided in Appendix B. Figure 3 illustrates the workflow of how our policy neural network interacts with the flooding environment.Within the policy neural network, we utilize an exploration encoder, comprising a representation encoder and a random encoder, to aid agents in identifying the destination point within a road network.The observation space, depicted on the left side, consists of various types of information, including neighborhood graph embedding, flooding graph embedding, target node, current node, path information, and action mask.The  action design, detailed in Section 4.3.2, is characterized by three directional movements: right (R), left (L), and forward (F).The reinforcement learning simulation process is depicted in Figure 4.In our configuration, the graph functions as the environment, with the simulation process running continuously until convergence of the policy network is achieved.The graph environment provides a reward as feedback to the policy network for parameter optimization, based on the actions taken by the agent.The simulation episode concludes when the agent reaches the destination point or exceeds the maximum horizon (episode length), at which point the environment returns a terminal signal.As demonstrated in Figure 5, the overall simulation comprises approximately 2 million episodes, with each episode containing around 100 steps.This robust simulation process ensures a comprehensive exploration of the environment and the effectiveness of our policy network.
In the training step, we followed the definition of the path finding algorithm (Hart et al. 1968), so an agent starts from an arbitrary node and moves along the edge between nodes as a path until it reaches the destination node.Since the hyper-parameter tuning for different models varies from experiment to experiment, we used 'Tune' (Liaw et al. 2018) to adjust the neural network structure and all other configurations.For example, in the Houston experiment, the tuned neural network structure for a random-encoder method is 256 � 256 � 256 fully-connected hidden layers with 'RELU' activation functions (Agarap 2018).In the experiment of the auto-encoder method, the inverse model and forward model were both set to 256 � 256 and the hidden layers were tuned to 256 � 256 � 256 with 'RELU' activation functions.The PPO policy network was tuned with 128 full-connected hidden layer for action masking with fullyconnected hidden layers with a dimension of 256 � 512 � 256: The learning rate for small and large environments was set to 0.0003 and 0.0005, respectively.In general, the tunable reward factors a and b can be set to 0.5.As mentioned in the literature (Pathak et al. 2017), the reward factor can also be optimized by linear or exponential decaying function.During the large experiment training, we increased the weights of v up to 0.5 following a linear space to help the agent follow the scaling optimization (Figure 6 shows the result of evacuation routing in a large flooding environment with 0.5 weighted scaling optimizations).We reset the environment in every training epoch to randomly sample a different node pair as the start node and destination node for generalization purposes.At the beginning of the training (for the first 100 epochs), we manually control the distance between the start node and destination node to let the agent learn the reference route faster (Goecks et al. 2019).Besides, we developed a data loader to process multi-dimensional data collected from each observation.In the data loader, the observations are flattened, and the different states in observation are flattened as 1 � N dimensions.We developed a high-performance geospatial cyberinfrastructure to optimize computational efficiency (Li and Zhang (2021)).For example, we used a Ray cluster (Liang et al. 2017) with three nodes to balance the workload of RL training.Figure 7 illustrates the general framework of the geospatial cyberinfrastructure used in this project.We used four nodes to balance the workloads for training, and each node has a 64-core CPU with an NVIDIA Ampere A30 Tensor Core GPU to accelerate computing.The master node and control node control all computing resources in our cluster and maintain the weights of the policy network.Each worker node is a replica set of training units, and it clones the flooding environment to perform distributed training.

Evaluation and results
We evaluated our model in several experiments.Because our task is focused on safety evacuation paths during the flooding event, the quality of the paths is measured using a safety metric that can be expressed as below: where n is the number of nodes along the path, i is the index of the nodes, and function dist calculates the closest distance between the current node i and the flooded node-set F : Figure 8 illustrates our recorded results from 16,000 route samplings.The blue line depicts the safety test results obtained using a traditional Dijkstra routing algorithm, while the red line represents the results from the scaling optimization.Both 'MA-' lines have been smoothed using an exponentially weighted moving average with a window of 1000 for clearer visualization.In a real flood event, a distance dropping to zero signifies a vehicle being inundated.The experimental results indicate that the traditional evacuation routing algorithm more frequently traverses flooded points, thereby potentially compromising passenger safety.Although our model occasionally approaches flooded points (eg around the 14,000th sample), it predominantly maintains an evacuation route that is safely distanced from flooding.This proximity to flood points is a result of the model balancing the effects of the reward discount (c) and the safety factor.Importantly, our method does not directly traverse flooded points, suggesting that it surpasses traditional algorithms in terms of safety evaluation.In reinforcement learning, the model's performance is evaluated using various metrics, including the episode reward and the episode length.The episode reward, comprising a sequence of states, actions, and rewards ending with a terminal state, reflects the model's ability to accumulate rewards within a specific environment and RL setting.Meanwhile, the episode length, indicating the number of actions an agent takes in an episode, can shed light on the agent's exploration capabilities and strategy utilization.
In our study, the episode length also represents the number of interactions the agent has with the environment.We employed distributed training techniques, scaling up to 4000 replications per iteration across three working nodes.We recorded the episode rewards and episode length after 400,000 trained environments for a small set of environments and after 1,600,000 trained environments for a larger set, using these as evaluation metrics.
Figure 5 presents the evaluation results across different training environments, with the y-axis indicating returned episode reward and length, and the x-axis representing training steps.An exponentially weighted moving average was used to smooth the line chart for enhanced visualization.Figure 5(a) demonstrates the training evaluation on Houston's small size map (a rendered demonstration can be found at 1).The detailed evaluation score is also listed in Appendix C. The results indicate that reinforcement learning models with extrinsic reward augments outperform the vanilla version of proximal policy optimization in terms of episode rewards.The auto-encoder optimization had a slower convergence than the random-encoder.However, the autoencoder representation method had a better final performance at the conclusion of the training because the random-encoder takes a different exploration strategy during the training.Figure 5(b) shows the episode length evaluation of the random-encoder is much higher than the auto-encoder method.This means the random-encoder recommends an evacuation route with more detours to avoid flooded areas, thereby providing a relatively bad consideration of the 'shortest' factor.
Therefore, we used the auto-encoder proximal policy optimization (PPO) model as our base model to generate optimal evacuation routes.To better analyze the actions taken by different optimization methods, we sampled those actions in a small test environment (Figure 9(a)).When an agent takes an action during training (green arrows are the three available actions that the agent can choose), we record each action and plot its distribution.Figure 9(b) shows the comparison of the auto-encoder representation method and the random-encoder method.We found out that the red dot (random-encoder) did not show good convergence for action 3 (Figure 9(b)) in the later training process, which means that the random-encoder had a much longer exploration period.This is also in line with the previously mentioned longer episode length in the evaluation of the random encoder.Based on the auto-encoder, we added scaling optimization for large environment training.Figure 5(c) shows the scaling optimization (green line) with the behavior cloning method to plan the evacuation route in a large city road network.The results show that PPO methods with autoencoder optimization can only produce good results when the agent is far away from the submerged point and barely able to reach the destination point.Figure 5(d) also shows that the episode length of the scaling optimization method remains within a reasonable range.
In Figure 5, the episode curves for the large and small Houston maps differ significantly.This divergence is primarily attributed to the exploration-exploitation trade-off and environmental non-stationarity.Initially, the agent's exploration strategy identifies rewarding action sequences, leading to an episode reward and length surge.However, as the agent refines its policy amidst continued exploration, temporary setbacks may occur, explaining the subsequent reward and length drop.This phenomenon is particularly pronounced in large, complex environments where simplistic models may struggle to capture the full environmental complexity, resulting in suboptimal performance.The non-stationarity of the environment, as detailed in Section 5.1, also plays a crucial role.Our model's evolving mechanism, simulating urban flooding's uncertain nature, introduces new flooding points during episodes in large road networks, increasing environmental complexity and unpredictability.This contrasts with small road networks, where shorter episode horizons limit the evolving mechanism's impact.On the small road network, the agent rapidly learns an optimal policy due to the environment's relative simplicity and stability.Conversely, on the large road network, the agent requires more time to adapt to the complexity and variability introduced by the evolving mechanism, explaining the initial reward and length increase, subsequent drop, and ultimate convergence.
We conducted simulations to investigate how our policy neural network interacted with the flooding environment using data collected during the Hurricane Harvey period from August 17, 2017, to September 3, 2017.The results of this simulation are presented in Figure 6.In the presence of changing flood conditions, the map information in traditional navigation software becomes distorted.However, our agent can improve the reference path even without this information by combining street information within the near-visible range and external flood information to develop an optimized evacuation route.Our approach generated an evacuation route for both the training environment (Figure 6  indicate that our approach can efficiently plan a secure route during a flood event with limited information on the actual road network.
Our proposed framework worked well using various datasets.For example, Wu (2021) used New York City 311 data (NYCOpenData 2022) to create a near-real-time flooding map to help people understand the impact of Hurricane Ida. Figure 10(

Discussion
The reinforcement learning method estimates each state-action value to optimize disaster evacuation strategies.More route strategies and spatial patterns can be found and investigated by choosing different reward factors instead of the proposed optimization techniques.For example, we can use road type attributes as a reward factor to change the road preference for the evacuation agent.Figure 11(a) shows the regular evacuation plan in a small environment without road type preferences.By rewarding the route with a road type r ½road� ¼ L p L (eg L P denotes 'primary' road type and L denotes the total length from start point to destination), the agent will choose a different route to complete the task (shown in Figure 11(b)).
Compared with other heuristic algorithms, reinforcement learning shows good adaptability so developers can easily add multiple tasks without redesigning the policy optimization method (Cai et al. 2019).For example, emergency managers often need to add stops during evacuation planning (eg picking up their child from school).By adding a bonus point in the road network that returns an extra reward to the agent, we can fine-tune the model to plan a multi-stop route in a dynamic environment (shown in Figure 11(c)).
As shown in Figure 8, our method is not significantly affected when attempting to create a 'safe' route versus a 'short' route.The reinforcement learning model seeks to make an informative decision that leverages the different conditions to create a better route.However, a 'do not get flooded' bottom line still restricts the agent's behavior.This restriction is beneficial for developers and users to understand its strict characteristics for future development and usage.

Future works
Although our analytical experiments demonstrate good results in route planning, testing on real flooding events is needed to validate our algorithms.Future research will focus on a field experiment with actual weather and road conditions to put the theoretical work into practice.This will also provide an opportunity to test the potential inclusion of more complex actions in the action space, thus further enhancing the practicality and versatility of our model.Extending this approach to broader impacts, our RL routing algorithm can also be adapted for future self-driving systems and even exoplanet rovers.Given the similarities of disaster events, exoplanet rovers also lack external guidance.With limited online information (GPS or other satellite imagery), the reinforcement learning agents can be trained to navigate the rover to its destination.
Moreover, the inherent flexibility of reinforcement learning provides the possibility of incorporating more complex actions into the action space, including 5-way intersections and U-turns.At the same time, these actions can enrich the algorithm's capacity and enable the agent to navigate through more complicated road networks.On the other hand, these new features will increase the algorithm's computational complexity in terms of convergence and exploration which should be considered in the implementation stage.As part of our future endeavors, we plan to explore additional training and optimization techniques, such as the use of a pre-trained routing policy (Wu et al. 2023).We anticipate that such improvements could considerably broaden the applicability of our model to a diverse array of road situations.

Conclusion
In this study, we developed an RL-based routing algorithm to help people navigate in urban areas during flooding events under complex, information-limited, and dynamic road network conditions.Our graph-based RL model, equipped with real-world action design and reward settings, has demonstrated practicality and feasibility in navigating real-world scenarios with multiple flood information sources.The key findings and implications of our study are: � Our model effectively learns spatial information around flooding areas, producing high-quality evacuation routes even in the face of changing road conditions (eg road closures caused by flooding) and limited neighborhood information.� The integration of reinforcement learning into the routing algorithm represents a significant intellectual contribution to GIScience, enabling navigation in complex and dynamic large road networks.� The practical implications of our research extend to the field of disaster management, providing safe evacuation routes during flooding events by embedding routing information during the training processes.� The versatility of our method allows for adaptation to other disaster events such as earthquakes or volcanic eruptions, making it a valuable tool for urban navigation.

Figure 1 .
Figure 1.A demonstration of a real-world road map and graph road network.

Figure 2 .
Figure 2. A comparison of different action designs in RL navigation research.The yellow nodes are the start and target points, and the blue node is a flooded point.This sub-graph is sampled near (29.7685519,-95.3772329) in Houston downtown.

Figure 3 .
Figure 3.A demonstration of our policy neural network interacts with a flooding environment.The expected reward is integrated with intrinsic reward, extrinsic reward, and a scaling optimization.

Figure 4 .
Figure 4.A demonstration of reinforcement learning simulation process.

Figure 5 .
Figure 5.A comparison of results of different optimization methods in our experiments.The y-axis is returned episode reward and returned episode length and the x-axis is the training steps.Here we use an exponentially weighted moving average with 500 window size to smooth the line chart for better visualization.The smoothed line chart is marked with the prefix 'MA-'.The 'RE' is the experiment of the random-encoder method.The 'AE' is the experiment of the auto-encoder representation learning method.The 'VAN' is the vanilla version of proximal policy optimization.The 'SC' is the scaling optimization method.

Figure 6 .
Figure 6.A demonstration of our policy neural network interacting with flooding environment.

Figure 7 .
Figure7.A illustration of using high-performance cyberinfrastructure architecture to implement the reinforcement learning routing algorithms.We used four nodes to balance the workloads for training and each node has a 64-core CPU with an NVIDIA Ampere A30 Tensor Core GPU to accelerate the computing.

Figure 8 .
Figure8.A demonstration of routing evaluation on random sampled Houston environment in 16,000 times.The red line is the smoothed moving-average using our method and the blue line is the smoothed moving-average using traditional method.The y-axis is the distance (meters) to the closest flooded point and the x-axis is the number of sampled routes.

Figure 9 .
Figure 9.The exploration and exploitation comparison of auto-encoder representation learning method (AE) and random-encoder method (RE).In this small environment (shown in the left figure), the best action is 'turn right', which is numbered action '3'.
a) shows the flooding map in New York City created by New York City 311 data.Simply replace the flooding dataset with the previously mentioned (Section 3) data format, and we train and run our model into a different environment.The results of using New York City 311 data instead of the BluPix dataset to test the adaptability of our proposed framework are thus plotted in Figure 10(b).Our model still works well in a new environment using different data sources.

Figure 10 .
Figure 10.An illustration of using New York City 311 data to build a flooding environment for testing our reinforcement learning routing algorithm.(a) 311 calls made during the week following Hurricane Ida (blue) and the week following Hurricane Henri (red) from Wu (2021).(b) The results of using New York City 311 data to test the adaptability of our proposed framework.The yellow dot and green dot represent starting point and destination point, respectively.

Figure 11 .
Figure 11.Reinforcement learning evacuation planning model adaptations.The yellow dots represent two endpoints between a route.The blue dots represent flooding points in the environment.(a) Using road type 'all' to set road preference.(b) Using road type 'primary' to set road preference.(c) Multi-stops route planning in Houston small road network.

Table 1 .
Comparison of real-world routing algorithms.