Computationally efficient GPU based NS solver for two dimensional high-speed inviscid and viscous compressible flows

In this study, we proposed a novel GPU-based solution for modelling two-dimensional inviscid and viscous compressible supersonic/hypersonic flows. Texture and surface pointers are used to access GPU memory locations. For effective and efficient use of surface pointers, we grouped multiple 2D arrays referenced and indexed by a single 3D surface pointer. To enable the proposed solver for double-precision calculations, two consecutive 32-bit memory locations were grouped to maintain the efficiency of surface pointers while taking advantage/accuracy of 64-bit calculations. Resolving data and computation dependencies for parallel applications is another complex task that is the focus of this study. Computation dependencies have been solved by using multiple mutually synchronized GPU kernels and executing them sequentially using the GPU default stream to ensure that all relevant data is available to the threads or computed before they actually use it. Consequently, there is no intra-core data dependency in our proposed approach, while inter-core data dependency is successfully solved by stringing multiple kernels together. Using NVIDIA GTX 660 GPUs, we achieved 20x speedup compared to traditional Core i5 $ {\circledR} $ ® computers. This speedup is the result of the Surface Pointer's GPU capabilities for double precision computations. The simulation results are also consistent with the experimental and numerical results of this study.


Introduction
Computational fluid dynamics (CFD) is the study of gas or fluid flow through software modelling of the underlying physics.The Navier-Stokes (NS) equation is the basic equation for CFD, which describes the relationship between pressure, velocity, density, and temperature for fluids in motion (Chandar et al., 2013).Another important factor for fluids moving at high velocity is the occurrence of shock waves in the flow field (Hoffmann & Chiang, 2000).A shock wave is generated when the fluid, gas, or plasma is flowing faster than the speed of sound.When a shock wave is generated, an almost discontinuous change in the temperature, density, and pressure of the fluid is observed.When modelling and simulating fluid flow at higher velocities, it is important to consider such parameters (Hoffmann & Chiang, 2000).
As the geometry on CFD becomes more complex, the size of the computational problem increases significantly CONTACT Muhammad Naveed Akhtar naveed@pieas.edu.pk;Band Shahab shahab@yuntech.edu.tw;Amirhosein Mosavi amir.mosavi@bgk.uni-obuda.hu(Soukov, 2021;Wang et al., 2021).In addition, capturing complex flow features further increases the computational complexity by requiring more grid points to be generated around the domain.These features include shock waves, vortex structures, interactions between the boundary layer and flow separation, etc. Chen et al. (2007), Ladeinde and Nearon (1997), and Nielsen (2004).In addition, the underlying numerical methods require a large number of iterations to converge.CFD Simulations on a modern multi-core computer system can take hours or days (Zhai & Chen, 2003).One way to speed up this process is to perform computations in parallel, which requires a highly scalable parallel computing platform such as a graphics processing unit (GPU).
GPUs are the most compact and massively parallel hardware.Modern GPUs are equipped with thousands of cores operating in parallel and very high memory bandwidth (Lai et al., 2020).GPUs are based on the streaming model of processing and are capable of parallelizing computationally intensive tasks (Glaskowsky, 2009;Weiskopf, 2007).The floating point performance and memory bandwidth of GPUs are several times higher than a conventional CPU (Cuda, 2015).Existing applications that require the computational power of GPUs need to be optimized and rewritten for GPUs using CUDA or OpenCL, etc. Cuda (2015) and Cuda (2020).
Many studies have been conducted to accelerate CFD simulations with GPUs, but most of them considered only incompressible flows.For example, Chandar et al. ( 2013) presented a GPU-based incompressible NS-solver for motion over fixed grids.Wang et al. (2020) proposed memory access patterns for higher order CFD stencil computation.Wang et al. (2014) solved the incompressible 3D NS equation using a combination of CPU and GPU.Thibault and Senocak (2009) simulated incompressible flows of GPU using the NS equation and CUDA.Griebel and Zaspel (2010) simulated an incompressible 3D two-phase flow on multiple GPUs.Tölke and Krafczyk (2008) presented a TeraFLOPs calculation for GPUs, but for the less computationally intensive Lattice Boltzmann method.Zuo and Chen (2010) accelerated the simulation of the NS equation along with the transport equation for airflow in HVAC systems with GPU.Goddeke et al. (2009) have proposed GPU acceleration for FEM NS solver.Rogers and Kwak (1990) used an updraft difference scheme to solve the NS equation accurately in time.Z.H. Ma. et al. have simulated compressible but non-viscus flows on GPU using the meshless method (Ma et al., 2014).Dequan Xu et al. also simulated compressible but non-viscous super-sonic flows on multi-GPU platform (Xu et al., 2021).Zhengyu Tian et al. has used GPUs for simulating supersonic flows on hybrid grids (Tian et al., 2020) and established the accuracy of GPU computation due to its double precision.Other related work could also be found in Adeeb and Ha (2022), Kun and Xiaowen (2022), Kale et al. (2022), Kale, Sharma et al. (2022), Lai et al. (2020), Shao et al. (2022), Wei et al. (2020), andWeng et al. (2021).
To the best of our knowledge, there is no work reporting results from GPU-based NS solvers which consider compressible, viscus and for supersonic/hypersonic flows at the same time.Moreover, there are no reports on using the surface pointer capability of GPUs for such simulations.In this work, we have presented a GPUbased solver for viscus, compressible supersonic and hypersonic flows and its simulation results.We have implemented the modified Runge-Kutta method (Damevin & Hoffmann, 2001) together with the Harten-Yee upwind total variation diminishing (HY-TVD) scheme (Yee, 1989) for shock wave detection on GPU by exploiting the power of surface pointers in an innovative and efficient way.To the best of our knowledge, this is the very first implementation of high-speed, viscous, and compressible flows on GPUs using their surface pointer capability.Main contributions of this study are given below.
(1) We grouped multiple 2D arrays referenced and indexed by a single 3D surface pointer to leverage the efficiency of surface pointers on GPU.(2) We proposed 3D arrays of size X×Y×N, where X and Y are the grid dimensions in a problem and N is the number of 2D arrays packed in the collection, to allow easier and more effective access to GPU memory space.
(3) For double precision calculations, 2 consecutive 32-bit memory locations have been grouped to preserve the efficiency of surface pointers while taking advantage/accuracy of 64-bit calculations.(4) Multiple mutually synchronized GPU kernels are implemented to solve the problem of data and computation dependencies in parallel applications/ computations.
The rest of the paper is organized as follows.Section 2 describes the governing equations for simulation and explains numerical method in detail.Implementation details of our proposed GPU-based NS solver are given in Section 3. Test cases for performance evaluation are detailed in Section 4 and results are discussed in Section 5. Finally, Section 6 concludes this study.

Governing equations
In the present study, the two-dimensional NS equation (Hoffmann & Chiang, 2000), which contains continuity, momentum and energy equations, is solved to simulate viscous compressible flows with high velocity.NS Equation can be expressed in the strong conservation form as shown in Equation ( 1) and described in Equations (2) and (3). Where; and, Here Q is the vector of conservative variables, and E, F show inviscid fluxes, and E v , F v are viscous fluxes.All other variables denote their conventional meanings.For inviscid flow simulation, the viscous fluxes in the right side of Equation ( 1) become zero.The above Equations ( 1) -( 3) are transformed from Cartesian (x, y) to curvilinear coordinates (ξ , η) and can be expressed in flux vector form as shown in Equation ( 4) and described in Equations ( 5) and ( 6) in the following. Where; and, Where J denotes the Jacobian of transformation from (x, y) to (ξ , η) coordinates.Further details of the transformation can be found in Hoffmann and Chiang (2000).
In the present study, air is assumed to be a perfect gas, as shown in Equation ( 7).Therefore; Where, for air,γ = 1.4The absolute viscosity of the air is computed by Sutherland's law, described in Equation ( 8).

Numerical method
Equations described in previous section are hereby solved by using fourth order modified Runge-Kutta scheme (Damevin & Hoffmann, 2001), as shown in Equation ( 9).
Where LQ operator consists of convective part and viscous part and has been discretized by 2nd order central difference approximation.Equation ( 9) is computationally more efficient than the classical Runge-Kutta method, since it is not necessary to store the previous three steps in order to compute the n+1 step.Since this method is a straight-order method, undesirable oscillations may occur in the vicinity of shock waves or other flow gradients.However, these oscillations can be suppressed by adding an artificial damping mechanism.In Yee (1989), the Harten-Yee's formulation, total variation diminishing (TVD), was used as an artificial damping term.There are two different ways to introduce the damping mechanism into the scheme.One way is to add it at each step of the modified Runge-Kutta scheme.
The second approach is to add it in the post-processor phase after computing n + 1 steps.In this study, the latter approach was used because it requires less computational effort and time.The post-processor step with the dissipation part of the TVD scheme can be expressed as shown in Equation (10).
Here G is a limiter which has been computed using Equation ( 16).

Implementation details of proposed GPU-based NS solver
As described earlier, GPUs enable massively parallel implementation of the numerical solution to the Navier-Stokes equation.In this section, we explain our implementation in this regard.A simplified sequence of operations to solve the equations under Section 2 is detailed in Algorithm 1.About 30 two-dimensional arrays of the same size are needed to accommodate all variables.
When multiple blocks are modelled simultaneously with these equations, the number of arrays increases even further.All these arrays had to be accommodated in the GPU's global memory and are updated accordingly.The fastest way to access memory locations on the GPU is through textures and surface pointers when used intelligently (Cuda, 2014).Typically, a surface pointer binds to a single CUDA array, so we need about 30 surface pointers to manipulate all of these arrays.According to Table 13 of Cuda ( 2014), there is an upper limit of 8 surface pointers to global memory for GPUs with processing power 2.x and 16 for GPUs with higher processing power.Clearly, this number is not sufficient to directly exploit the efficiency of surface pointers.If we still want to take advantage of the performance and efficiency of surface pointers, the only way is to use some kind of coordinate transformation technique and show multiple variable arrays with few surface pointers.In this study, multiple 2D arrays are grouped to be referenced by a single 3D surface pointer and indexed accordingly.This technique is explained in Figures 1  and 2. All 2D arrays needed for the problem are packed into a single 3D array and a surface pointer references this common block.The size of the 3D array is given by X×Y×N, where X and Y are the grid dimensions in a problem and N is the number of 2D arrays packed in the collection.CUDA Surfaces use a 32-bit computing architecture by default.To make it suitable for doubleprecision computations, 2 consecutive 32-bit memory locations have been grouped to preserve the efficiency of surface pointers while taking advantage/accuracy of 64-bit computations.
Resolving data and computation dependencies is another complex task for parallel applications.These dependencies have been resolved by using multiple GPU kernels and executing them sequentially using the GPU's default stream.The reason for this is to ensure that all relevant data is available to the threads or computed before they actually use it.Also, a grid point of the geometry and all related computations for that point have been assigned to a single thread.With this approach, no thread has to wait for data computed by a neighbouring thread.In other words, it can be safely said that there is no intra-core data dependency in our proposed approach, while inter-core data dependency is successfully solved by queuing multiple kernels in order of computation to the default stream.The kernels from the standard stream are started in the order of the program and the next kernel starts its operation only when the previous one finishes.In this way, synchronization between kernels is achieved.Also, where necessary, two separate sets of variables were used for source and destination to avoid conflicts when reading and writing memory.
In total, 9 GPU kernels were used to compute different parts of the problem respectively, which are summarized in Table 1.All of these kernels are enqueued into the standard GPU stream, as shown in Figure 3.A 2D block distribution was used to decompose the problem.We   used the NVIDIA C compiler (nvcc) from CUDA toolkit 6.5 to compile all the code.The host operating system was Fedora 20 with a memory of 4 GB.All programs run on an NVIDAI Geforce GTX 660 GPU.

Performance evaluation
The proposed scheme for accelerating the simulation of inviscid and viscous flows was considered for different geometries.The obtained results were also compared with existing experimental and numerical results (see  1. Section 5).The details of the test cases used on GPU are given below.

Case-1: 5 0 wedge with cylindrical leading edge
Steady state, inviscid simulation around a two dimensional wedge having cylindrical leading edge was performed with conditions given in Table 2, found in Prabhu et al. (1989).The grid generated around wedge along with boundary conditions is shown in Figure 4(a) and it is composed of 162 × 101 points.

Case-2: compression corner
The simulation of a steady-state viscous hypersonic flow over a 7.5 0 compression corner (Simeonides et al., 1994) was performed under the conditions given in Table 2.The numerical simulation around the compression corner is described by the leading edge shock, the corner oblique shock, the interaction of the shock boundary layer, and the recirculation region at the corner.To capture these features, a 110 × 55 point mesh was generated around the compression corner.The resolution of the mesh near the wall was kept high to capture the flow gradients and recirculation region, as shown in Figure 4(b).
Flow characteristics were extrapolated at the outflow and at the top of the region.Noslip and adiabatic boundary conditions were established at the wall.

Case-3: 2D half cylinder
A two-dimensional, steady-state, viscous hypersonic flow over a half-cylinder (Prabhu et al., 1989) is simulated with the parameters listed in Table 2.The numerical simulation of the hypersonic flow around such a blunt body is described by a strong detached bow shock and a subsonic flow in the stagnation region.Figure 4(c) shows a grid with 55 × 85 points around the half-cylinder.To calculate the heat flux at the stagnation point, the spacing of the first grid point was set to 1.0 × 10 −6 m.The solution was started under the free flow conditions.At the outflow, the flow variables were extrapolated.At the wall, a slip-free boundary condition was applied.

Case-4: backward facing step
For the conditions listed in Table 2, a steady-state viscous high-speed simulation was performed at the backward 2D stage (Smith, 1967).For this configuration, two mesh blocks were generated, shown in Figure 4(d).The first block consists of 52×52 grid points, while the second block has 81×97 grid points.The supersonic flow field on the backward stage is characterized by an expansion wave at the corner, a separation region, a reattachment, and a re-compression wave.Free flow conditions were specified at the beginning of the solution.The flow characteristics were extrapolated at the outflow and at the top of the two blocks.Slip-free and adiabatic boundary conditions were specified at both walls.All cases described above use the GPU kernel execution order defined in Algorithm 1.The only exception is the backward steps case, where 2 blocks are actually used.In this case, steps 5 through 23 of Algorithm 1 are repeated twice (i.e.once for each block).Another important difference is that the 5 0 wedge geometry is solved using the Euler equation, which does not require calculations of the viscous term.This change is explained in Algorithms 2 and 3 (One stage of the RK4 method).In all cases, Algorithm 2 is used for the calculations in all stages of the RK4 method, while the 5 0 wedge geometry uses Algorithm 3 for all stages of the RK4 method.The value of α used for the modified RK4 method is 1/8, 1/6, 1/4, 1/2 and the value of β is 4, 3, 2, 1 for all 4 stages (Damevin & Hoffmann, 2001).

Results and discussion
As mentioned earlier, a total of 4 different case geometries were simulated using our proposed NS solver on GPU.The structured mesh for each of the geometries is shown in Figure 4.The geometries named compression corner, 5 0 wedge and half cylinder were modelled as single block mesh, while the backward step was modelled with 2 blocks due to the presence of corner in the domain.The simulation results for the density profile and mach number for each case are shown in Figures 5(a,b), 6(a,b), 7(a,b), and 8(a,b).The simulated results for these geometries were also compared with the available experimental or numerical results and good agreement was found, as shown in Figures 9(a,b) and 10(a-d).The time analysis and comparison of all cases on GPU and CPU for 10 iterations is shown in Figure 11.The individual time taken for each kernel is shown in Figure 12.The time comparison for 10k iterations is shown in Figure 13 and the speedup achieved in each case is given in Figure 14.
Figure 11 compares the GPU and CPU time required for 10 iterations, considering all 4 geometries.The GPU total time is the sum of kernel time and memory copy time.The memory copy time is negligible because the memory copy is performed only once before the start of the computation and then the results are copied back to CPU after the completion of the iterations.In all 4 cases, significantly less time was required on the GPU than on CPU.The geometry for the 5 0 wedge contains   the largest number of points to compute, but it too takes less time than the 2D half cylinder which has a smaller number of points to compute.This is because there are no viscous terms that require additional computation time.This difference in time is evident in the next Figure 12.   Figure 12 shows the time taken by each kernel computation for each geometry.The first noticeable point is that most of the time is spent on kernel-1, kernel-2, and kernel-3 calculations.All other kernels and memory copies account for less than 10% of the total computation time in all cases.Kernel-1 is the main routine that  computes all 4 steps of RK4 and the associated variables.Kernel-2 and Kernel-3 are responsible for calculating the total variation diminishing (TVD) variables (TVD(x) and TVD(y)).The calculations performed by all other kernels are explained in Table 1.Another noticeable point is that the computation time trend is similar for all cases, but not for the 5 0 wedge geometry.This is due to the complex and time-consuming calculations for viscous terms, which are not considered in this case.This can be easily found out if you compare the two Algorithms 2 and 3. Algorithm 3 lacks the calculations for viscous terms, which are not required for the 5 0 wedge geometry because it is simulated with the Euler equation.
Figure 13 shows the time required by CPU and GPU for all cases where it takes thousands of iterations to converge the problem to the solution.It is obvious that the problems solved on the GPU are really fast.Backstepping geometry is a comparatively computationally intensive problem and takes about 450 seconds on CPU.Solving the same problem on a GPU takes less than 50 seconds, which is clearly better than solving on CPU.The speedup achieved on the GPU is shown in Figure 14.The maximum speedup achieved on the GPU is about 23, for a compressed corner geometry.The speedup is nearly constant for each number of iterations.

Conclusion and future work
This study presents an efficient GPU based solver for compressible, high speed flows along with simulations  results obtained for four different 2D structured geometries using this solver.Modified RK4 method along with Harten-Yee upwind TVD scheme for shockwave capturing has been implemented for solving the governing equations.The most powerful, compact and massively scalable parallel platform (GPU) was used for these simulations for the very first time.Surface pointers capability of GPU was also exploited to speedup double precision computations.For these problems, results show a speedup between 8 to 22 and more than 20 on an average GPU than that of standard core i5 computing machines.It is evident if CFD codes and software use GPU for simulation of problems, the computation speed can be enhanced.
In the future, this research could be extended to 3D solvers.Moreover, the concept could be extended to implicit solvers.A thorough investigation might be needed to determine the computational cost of these types of solvers.Another research direction could be to study the behaviour of solvers for transient and fuzzy systems.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure 1.Setting up memory read/write through surface pointers.

Figure 2 .
Figure 2. Memory read/write example through surface pointers.

Table 1 .
List of kernels and their utilization in stated cases.

Figure 3 .
Figure 3. Example sequence of kernel execution for case-2 mentioned in Table1.

Figure 4 .
Figure 4. Mesh structure for all 4 problems.(a) Mesh for 5 0 wedge, (b) Mesh for compression corner, (c) Mesh for half cylinder and (d) Mesh for backward facing step.

Figure 6 .
Figure 6.Simulation results for compression corner.(a) Density profile and (b) Mach no.profile.

Figure 7 .
Figure 7. Simulation results for 2D half cylinder.(a) Density profile and (b) Mach no.profile.

Figure 8 .
Figure 8. Simulation results for backward facing step.(a) Density profile and (b) Mach no.profile.

Figure 9 .
Figure 9.Comparison of RK4 + HY-TVD with existing numerical and experimental results for 5 0 wedge.(a) Comparison of pressure distribution along stagnation line for 5 0 wedge and (b) Comparison of temperature distribution along stagnation line for 5 0 wedge.

Figure 10 .
Figure 10.Comparison of RK4 + HY-TVD with existing numerical and experimental results for cases 2 to 4. (a) Pressure coefficient distribution along compression, (b) Skin fraction coefficient distribution along compression corner, (c) Pressure distribution along the wall of 2D backward facing step and (d) Heat flux distribution along 2D half cylinder.

Figure 11 .
Figure 11.CPU and GPU time comparison for all 4 cases.

Figure 12 .
Figure 12.Time taken by individual kernel.

Figure 13 .
Figure 13.Time vs. increasing number of iterations.

Figure 14 .
Figure 14.Speedup vs. increase in number of iteration on GPU.