Parallel performance assessment of a pseudo-spectral solver for transition and turbulent boundary layer flows

The computational performance and accuracy of a parallel pseudo-spectral solver is assessed, and the solver is validated for direct numerical simulations (DNS) of bypass transition and turbulent boundary layer flows. The solver shows spatial order of accuracy of 6.8 and 3.7 for laminar and turbulent flow simulations, respectively, and scalability up to 16 thousand processors. DNS predictions for plane channel and flat plate bypass transition flows compare very well with the benchmark results available in the literature. Analysis of the turbulent structures shows counter-rotating longitudinal structures in the sub/buffer layers, which inducing lifting on each other to generate series of sweep and ejection events. These events generate shear stress which transfers energy from mean flow to streamwise turbulent fluctuations. The energy is then redistributed to other turbulent velocity fluctuations via pressure-strain. Temporal DNS of bypass pass transition flow compare very well with spatial DNS using an order of magnitude lower computational cost, thus is identified to be a viable alternative for transition flow physics simulations.


Introduction
Engineering application of computational fluid dynamics (CFD) require robust turbulence and transition models (Borse, Bhushan, Walters, & Burgreen, 2018;Fu & Jin, 2018;Liu, Chen, Zhou, & Zhang, 2018). Advances in high performance computing (HPC) are enabling the use of direct numerical simulation (DNS) as a tool for transition/turbulence flow physics analysis and modeling. For example, Zhang, Watanabe, and Nagata (2018) studied turbulence entrainment across turbulent/nonturbulent interfaces for compressible boundary layer. The study identified that the mass flux from viscous sublayer to turbulent sublayer occurs in the direction normal to the turbulent/non-turbulent interface, while a tangential transfer dominates mass flux in the turbulent sublayer. Xia, Kun, and Fan (2018) investigated heat transfer in turbulent boundary layer including the effect of particles and forced convective conditions. Results demonstrated that buoyancy enhances the wall normal turbulent velocity fluctuations and subsequently heat transfer. Surface roughness due to particles destroy the longitudinal streaks in the sublayer and enhance turbulence, while stabilizing temperature fluctuations. Gruber, Richardson, Aditya, and Chen (2018) investigated flame propagation in turbulent boundary layer under the effect CONTACT S. Bhushan bhushan@me.msstate.edu of fuel-oxidant mixture stratification. The study concluded that for homogeneous fuel-oxidant mixture, the near-wall flame front deflects the flow away from the wall. Whereas for a stratified mixture, flow is accelerated near the wall resulting in a stronger flame-turbulence interaction. Masih, Bhushan, and Barros (2018) investigated energy transfer between synoptic and mesoscales in atmospheric boundary layer. The study identified that a mesoscale feedback is required to explain the dual inertial ranges observed in atmospheric boundary layer. Wu et al. (2017) investigated the turbulent spot inception mechanism in incompressible bypass transition flows over flat plate boundary layer under zero-pressure gradient. They observed that turbulent spot inception occurs as spanwise vortex filaments fold to generate vortices and then into hairpin packets, analogous to the growth of secondary instabilities in natural transition.
This study focuses on DNS of incompressible boundary layers flows to assist in the development of transition/turbulence models (Bhushan, Walters, Muthu, & Pasiliao, 2018). Traditionally, DNS of incompressible boundary layer flows have focused on fully developed turbulent channel flows (Alfonsi, 2011). Early on, pseudo-spectral solvers were the solver of choice, which use fast Fourier transform (FFT) along the streamwise and spanwise directions, involving periodic boundary condition, and Chebyshev polynomial along the wall normal direction (Kim, Moin, & Moser, 1987;Moser, Kim, & Mansour, 1999) as they allow for coarser grid resolution than finite-difference or finite-volume counter parts. Further, a periodic boundary condition along the streamwise direction was justified, as the boundary layer growth ceases when the boundary layer thickness reaches the half channel height. However, pseudo-spectral solvers have limitations for domain decomposition for HPC, as FFT (and inverse FFT) operations require all the data along the FFT direction (Friedrich, Hüttl, Manhart, & Wagner, 2001). Lee, Ulerich, Malaya, and Moser (2014) pointed out that spectral methods require expensive all-to-all communication, but the reduced grid size requirements compensate for the increased communication costs.
Some recent studies have retained the FFT schemes along streamwise and spanwise directions but have replace Chebyshev's polynomial in wall normal direction. For example, Hoyas and Jiménez (2006) replaced Chebyshev's polynomial in their solver by a fourthorder finite-difference scheme. The solver with Chebyshev's polynomial did not include domain decomposition along the wall normal direction, which limited its application to DNS of channel flow up to Re τ ≤ 950 (Del Alamo, Jiménez, Zandonade, & Moser, 2004), where u τ is the friction velocity, H is the half channel height, and ν is the kinematic viscosity. Whereas, the solver with finite difference scheme allowed domain decomposition, and enabled DNS of channel flow up to Re τ = u τ H/ν ≤ 4200 (Lozano-Durán & Jiménez, 2014). Lee and Moser (2015) replaced Chebyshev's polynomial by a seventh order spline scheme for high Re flows (channel flow up to Re τ ≤ 5200).
Some studies have opted for purely finite-difference schemes. Abe, Kawamura, and Matsuo (2001) and Iwamoto, Kasagi, and Suzuki (2005) used a fourth-order scheme along periodic streamwise and spanwise directions and a 2nd order scheme in the wall-normal direction. The solver was applied for DNS of channel flow for Re τ ≤ 2003. Bernardini, Pirozzoli, and Orlandi (2014) used second-order scheme for all the directions, and the solver was applied for DNS of channel flow for Re τ ≤ 4100.
DNS of bypass transition and/or turbulent flat-plate boundary layer flows primarily opt for the spatial approach either using finite-difference/volume schemes. Jacobs and Durbin (2001) used a second order central difference scheme and Zaki and Durbin (2005) used a second-order finite-volume scheme for Re θ ≤ 525, where θ is the boundary layer thickness. Wu and Moin (2009) used a second-order central-difference scheme for DNS of bypass transition flow up to Re θ ≤ 940, and Wu et al. (2017) extended the study for Re θ ≤ 3000. Jiménez, Hoyas, Simens, and Mizuno (2010) modified the channel flow solver used by Hoyas and Jiménez (2006) (discussed above), wherein a fourth order finite-difference scheme replaced the FFT scheme in the streamwise direction for turbulent flat-plate simulation up to Re θ ≤ 2000. Borrell, Sillero, and Jiménez (2013) applied the solver for turbulent flat-plate simulations up to Re θ ≤ 6650. Schlatter, Brandt, de Lange, and Henningson (2008) used a pseudo-spectral solver for DNS of bypass transition flow. The study used a spatial approach wherein an outer fringe region mapped the streamwise outflow to inflow as required for a periodic domain. Bobke, Örlü, and Schlatter (2016) recently used the above pseudo-spectral solver (as used by Schlatter et al., 2008) for the temporal simulation of flat-plate turbulent boundary layer flow with suction. The temporal simulations do not require mapping of the outflow to inflow, as it is expected that the simulation domain moves along with the streamwise direction (or along the plate) due to periodicity of the flow. Kozul, Chung, and Monty (2016) also used the temporal approach for simulation of growth of fully developed turbulent flat-plate boundary layer induced by a moving plate. The study concluded that temporal approach is currently under-exploited in the literature and may prove useful for understanding turbulent boundary layers. Zhang et al. (2018) used the temporal approach to study turbulence entrainment across turbulent/non-turbulent interfaces in compressible boundary layer.
Overall, a survey of the DNS for incompressible transitional and turbulent boundary layer flows show that simulation strategy is evolving to accommodate higher Re flow on advanced HPC systems. Researchers have adopted either of two paths. Some have opted for finitedifference methods and larger grids in lieu of spectral methods, probably due to HPC challenges associated with spectral methods. In particular, only limited studies have performed domain decomposition for Chebyshev's polynomial. Others have used spectral methods for temporal simulations as it significantly reduces grid size requirements compared to the spatial approach, as FFT (or periodicity) along streamwise direction is applicable.
The overarching objective of the current research is to validate a parallel pseudo-spectral solver for transition onset flow physics analysis using temporal DNS approach . This study builds on a precursor study by Bhushan and Walters (2014) and focuses on: (a) computational performance improvement of the solver; (b) assessment of the numerical accuracy of the solver; (c) validation of the solver for DNS of plane channel flows by comparing mean and turbulent flow predictions against Moser et al. (1999) DNS (M-DNS); and (d) preliminary validation of temporal DNS for bypass transition flow over flat-plate against spatial DNS (JD-SDNS) (Jacobs & Durbin, 2001) and experimental data (Roach & Brierly, 1990).

Numerical method and simulation summary
The study uses the pseudo-spectral incompressible Navier-Stokes (N-S) equation solver, ParSpectra, originally developed by Bhushan and Warsi (2005). The governing equations are discretized using Fourier series in the periodic streamwise (x), and spanwise (z) directions and Chebyshev polynomials in the wall-normal direction (y) and solved using a 3-step fractional step method (Butcher, 2000). Bhushan and Walters (2014) improved HPC of the solver by implementing MPI domain decomposition along wall normal (Chebyshev polynomial) direction using influence matrix method (Peyret, 2001), and using OpenMP thread parallelization for FFT calculations. The solver had limited to modest scalability up to 100's of processors. The primary focus of the study was to validate the applicability of the influence matrix method for turbulent flow predictions. The study performed large eddy simulations (LES) for plane channel flows for Re τ = 590 and 2003 using dynamic Smagorinsky model (DSM) (Lilly, 1992). The study also demonstrated applicability of temporal simulations for flat plate boundary transition. The following HPC improvements are performed in this study: (1) MPI domain decomposition along the streamwise FFT direction following Takahashi (2009). Thus, the updated version of the solver involves 2D MPI domain decomposition along x and y directions. (2) OpenMP thread parallelization for simultaneous calculations over the wall normal x 2 planes for each MPI domain. Table 1 summarizes all the simulations performed in this study as discussed below. Temporal and spatial convergence of the solver is studied for simple 2D laminar flows following (Brown, Cortez, & Minion, 2001;Liu, Ren, & Zhang, 2004;Minion & Brown, 1997). The test cases include: (1) simulation of traveling wave in a doubly periodic domain; and simulation of thick shear layer vortex roll-up in a (2) 2D doubly periodic and (3) streamwise periodic domains. DNS of plane channel flow at Re τ = 180 and 590 to: (4) evaluate the effect of aliasing error on the simulations and determine an optimal de-aliasing approach for DNS; (5) evaluate the spatial accuracy of the solver for turbulent flow simulations; (6) study the scalability of the solver; and (7) validate the predictive capability of the solver. Lastly, (8) preliminary validation is performed for temporal DNS of bypass transition flow. The exact solution for a decaying traveling wave is: Simulations are performed on a doubly periodic FFT-FFT plane in a unit square domain x × z = [0,1] × [0,1] with ν = 0.02 starting from the exact solution at t = 0. The thick shear-layer roll-up simulations are performed using an initial jet flow condition: where u is the velocity along the streamwise x direction, and v is the velocity perpendicular to the jet along ξ direction, ρ = 30, β = 0.05 and Re = 10,000. Simulations are performed in both FFT-FFT (F-F) and FFT-Chebyshev (F-C) planes. The F-F plane simulations are performed in a unit square doubly periodic domain 2], where ξ = y in Equation 2(a). A periodic boundary condition is used in the x direction, and a zero wall-normal gradient is applied at the y boundaries. The Gauss-Lobatto collocation points (η) are stretched to concentrate the grids at the center using the function: where L = 1.15, β = 0.9. The channel simulations are performed in a domain for Re τ = 590. Thus, for both the cases H = 1. A periodic boundary conditions are used along x and z directions and no-slip boundary at y = −1 and 1. The wall normal grid distribution was same as those of Gauss-Lobatto point distribution, as they already involve fine grid resolution at η = ±1: A favorable streamwise pressure gradient (dp/dx) = −u τ 2 is prescribed to balance the momentum loss due to wall shear stress τ w = u τ 2 . The flat-plate simulations are performed using a cubic domain where L is the domain length, using periodic boundary conditions  (Jacobs & Durbin, 2001) and experimental data (Roach & Brierly, 1990).
along the x and z directions, a no-slip condition at the bottom wall y = 0 plane, and a Neumann boundary condition at the top y = L plane as below: where u is the streamwise velocity, v is the wall normal velocity, and w is the spanwise velocity. Simulations are performed on five different domain lengths varying from L = 20δ 0 to 50δ 0 for Tu in = 3.5%, where δ 0 is the initial boundary layer thickness. The Gauss-Lobatto points were mapped to the vertical domain to [0, L] and stretched to concentrate the points close to the bottom wall y = 0 using a function: The details of the grid and temporal resolutions used in the simulations are provided in the respective sections below. The scalability study was performed on Titan @Oakridge Leadership Computing Facility (OLCF), which is a Cray XK6 machine with 16 central processing units (CPUs) per node and 2GB RAM per CPU. The channel and flat-plate simulations were performed on Talon @Mississippi State University HPC 2 facility using 48-96 processors, which is an IBM P6 machine with 12 CPUs per node and 2GB RAM per CPU.

Decaying traveling wave in a doubly periodic domain
The exact solution for the traveling wave consists of a staggered u, w and p peaks and troughs, which moves in the flow domain and completes one cycle in t = 1 s, as shown in Figure 1 The L1 norm prediction errors at simulation time t = 0.7, computed using the expected analytic solution, in Figure 2(a) shows second-order temporal accuracy. Simulations with different spatial resolution but same t show less than < 0.1% difference in the error. This is expected as FFT has exact spatial accuracy for this case (Minion & Brown, 1997). Further, as expected MPI   Minion and Brown (1997) (MB) using pseudo-spectral solver and 4th order Essentially Non-Oscillatory (ENO) finite-difference scheme. (c) Temporal evolution of wall shear stress for Re τ = 590 for DNS without de-aliasing (None), de-aliasing using 9/8-padding, 5/4-padding and 3/2-padding.
domain decomposition does not show significant effect on the solutions.

Thick shear-layer flow
The thick shear-layer initial condition results in two shear layers with opposite vorticity at ξ = 0.25 and 0.75, which evolves into vortex roll-up as shown in Figure 1(b). Simulations are performed using t = 2.5 × 10 −4 on: (a) five grids varying from 64 × 64-256 × 256 in F-F plane, and (b) four grids varying from 96 × 97-256 × 257 in F-C plane to evaluate spatial accuracy of the solver. (c) Simulations are also performed using a 256 × 257 grid in F-C plane using 2, 4, and 8 domain decompositions along Chebyshev's polynomial direction to evaluate errors due to domain decomposition. In the absence of an exact solution for this case, a fine grid solution is used as the reference solution for error estimation. Minion and Brown (1997) concluded that 2D F-F plane simulation on a 768 × 768 grid is a reasonable representation of the exact solution. Thus, additional simulations were performed on 768 × 768 grid in F-F plane and 768 × 769 grid in F-C plane for error estimation on coarser grids.
The convergence study focuses on L1 norm errors for velocity and vorticity predictions at t = 1 s.
As shown in Figure 2(b), the spatial order of accuracy of the F-F solver is ∼8 for grid resolution of 64 2 -96 2 , and ∼11.5 for finer grids. The accuracy of the solver is consistent with those of Minion and Brown (1997) F-F solver. The F-C solver shows an order of accuracy of 6.8 when grid is refined from 96 × 97-256 × 257. The F-C solver shows a lower order of accuracy than F-F solver, as the former does not allow a systematic grid refinement throughout the domain similar to the latter. The domain decomposition along Chebyshev polynomial direction shows < 4% variation in errors, which is somewhat large compared to those obtained for FFT domain decomposition. This is expected because the grid distribution along Chebyshev polynomial direction changes with domain decomposition but remains unchanged for FFT direction. The peak divergence in the simulations were obtained at Chebyshev's polynomial domain interfaces, which were 2.5 × 10 −6 |ω max |, where |ω max | is the maximum vorticity magnitude in the simulation. The peak values were about 5 times higher than the peak divergence obtained for single processor simulations. However, the divergence levels are deemed acceptable considering that they are still negligibly small compared to the maximum vorticity in the simulation.

Evaluation of aliasing error
The non-linear convective term in the Navier-Stokes equation involves multiplication of velocity and its gradients. Multiplication corresponds to convolution in the spectral space and its calculations up to Kth wavenumber mode depends on convolution of wavenumber modes as large as '2K'. In case the individual term extends only up to 'K' wavenumber modes, the Fourier amplitudes of the convolution is expected to be aliased. To have an exact Kth order representation, the individual terms should include expansion up to M = 2K modes. Canuto, Hussaini, Quarteroni, and Zang (1987) reported that expansion up to M = 3/2K modes is sufficient for the first alias-affected wavenumber to lie just outside the 'useful' range, i.e. K + 1. This is referred to as 3/2-rule, wherein extra 1/2K modes are added and initialized to zero. This is a well-accepted de-aliasing approach in the CFD community (Chow & Moin, 2003).
The effect of aliasing error is investigated for DNS of plane channel flow at Re τ = 180 and 590. Simulations are performed using a 128 × 129 × 128 grid for Re τ = 180 and using a 256 × 257 × 256 grid for Re τ = 590. The convergence of τ w is used for the assessment of the aliasing error. As shown in Figure 2(c), simulations without de-aliasing show lower τ w than expected. The under predictions are around 9% and 30% for Re τ = 180 and 590, respectively. Simulations with 9/8 zero-padding shows 5% over prediction for both Re τ , whereas both 5/4 and 3/2 zero-padding shows < 1% error. Results suggest that 5/4 zero-padding may be used to reduce computation costs. However, the aliasing error increases with increase in Re. Therefore, the above conclusion may not be applicable for higher Re. Simulations presented in this study used 3/2 zero-padding.

Spatial accuracy for turbulent flow simulations
The spatial accuracy of the solver for turbulent flow simulations is assessed using the grid verification approach used in engineering applications (Stern, Wilson, & Shao, 2006). In the grid verification approach, it is assumed that the numerical order of accuracy (P g ) can be determined from the asymptotic convergence of the results, i.e. the decay in the prediction errors are expected to follow the −Pg curve, where is the grid size. Herein, LES of channel flow is performed using DSM on five different grids ranging from 64 × 65 × 64-144 × 145 × 144 such that the grid size ratio / C varied from 1 to 2.25, where C is the coarsest grid spacing. All simulations use the same time step size t = 0.005U c /H, where U c is the centerline streamwise velocity. Note that larger time-step size can be used for coarser grid, but a constant time-step is used to avoid contamination of the grid uncertainty estimates by time-step uncertainty. The time-step size corresponds to t + = ( tu 2 τ /ν) = 0.14, which is lower than the t + = 0.4 recommended for DNS (Choi & Moin, 1994). Thus, the time-step uncertainties are expected to be negligible compared to the grid uncertainties. However, the errors could be contaminated by the coupled grid-modeling errors, as LES models explicitly depend on grid resolution. As shown in Figure 3, the spatial order of accuracy of the solver is ∼ 3.6 for turbulent flow simulations which are almost half of those obtained in simplified laminar 2D flows above.

Parallel performance
ParSpectra parallel performance was evaluated on five different grids ranging from 96 × 289 × 96-1024 × 767 × 1024 cells using solver CPU time for ten time-step calculations. Solver profiling shows that 55%, 15% and 30% of CPU time is spent for the convection marching, pressure correction, and viscous solver steps, respectively. The convection marching is most expensive as the non-linear term calculations require several forward and inverse FFT operations.
Strong scalability helps assess the parallel performance of the solver and determine the least number of grid points per processor required for effective utilization of the computational resources. An ideal scalability is achieved when the solver speeds-up linearly with an increase in the number of processors. However, as the number of processors increases so does the CPU time spent for MPI communication, thereby resulting in a degradation of the performance (Bhushan, Carrica, Yang, & Stern, 2011). As shown in Figure 4(a), the scalability improves with the increase in the grid size, because the time spent for MPI communication relative to the solution decreases. The solver shows reasonable scalability performance up to 8K processors on the finest grid consisting of 800 M grid points, where the performance is 30% lower than the ideal. For all the grids, the scalability performance is within 30% of ideal for 100 K grid points per processors.
As shown in Figure 4(b), the MPI domain decomposition along Chebyshev polynomial direction shows a super linear scalability. The super-linear scalability is due to significant speed-up in the pressure correction and viscous solver steps, which shows N 2 and N 3/2 scalability, respectively, where N is the number of processors. The convection step on the other hand shows linear scalability. The improved scalability for the pressure correction and viscous solver steps is due to speed-up in the Helmholtz solver, as the size of the LHS matrix decreases by N 2 , and subsequently computational cost of the matrix diagonalization method decreases super-linearly.
Strong scalability of the solver due to OpenMP thread parallelization is studied on three different grids consisting of 2M, 4M and 8M cells. As shown in Figure 4(c), thread parallelization shows peak performance on six threads. Scalability improves from 60% of the ideal on the coarse grid to 86% of the ideal on the finest grid. Poor scalability on coarser grids and larger threads is probably due to dominant OpenMP overhead. The computation time decreases by only 12% when memory available for a single thread is increased from 2GB to 24GB. Thus, memory bandwidth does not play a significant role in OpenMP speed-up.
Weak scalability study helps determine the maximum number of grid points per processor that should be used for simulations without significant loss of computational performance. An ideal scalability of unity (1) is achieved when the CPU time per iteration remains the same despite the increase in the number of processors (and grid size). Better scalability is expected for a lower number of grid points per processor; as they are less susceptible to memory bandwidth or cache sharing issues (Dongarra, Gannon, Fox, & Kennedy, 2007). Performance degradation is expected with the increases in number of processors (and grid size) as the overall MPI communication time increases. The scalability study is performed using 200, 400 and 800 K grid points/processor as shown in Figure 4(d). The scalability decreases by a reasonable 25% from the ideal for ≤ 400 K grid points per processor up to 4000 processors, whereas simulations using 800 K grid points per processor shows significant performance degradation for 1000 processors or more.
Overall, the solver shows good scalability up to 8K processors for grids consisting of 800M points. Some recent studies have demonstrated scalability to even larger number of processors. For example, the mixed finite-difference (along x-and y-directions) and FFT (along z-direction) solvers parallelized using hybrid 2D MPI domain decomposition (along x and y) and OpenMP showed reasonable linear scaling up to 33K processors (Borrell et al., 2013). The mixed spline (along y direction) and FFT (along x-and z-directions) solver parallelized using hybrid 2D MPI domain decomposition (along x-and y-directions) and OpenMP showed good scaling up to 786K processors (Lee et al., 2014). However, note that Borrell et al. (2013) and Lee et al. (2014) used much larger grid sizes, around 75 times (60 billion points) and 1000 times (700 billion points), respectively, compared to those used herein. The ParSpectra uses 2D MPI and OpenMP thread parallelization similar to above studies, and the scalability is reasonable considering smaller grid size. However, one notable difference is that ParSpectra uses domain decomposition for Chebyshev's polynomial direction. Since the solver is intended for temporal boundary layer simulations, the largest grid is expected to be around 584 × 385 × 384. The x domain can be partitioned to a maximum of 24 sub-domains, to ensure that number of sub-domains is less than the number of collocation points per sub-domain. In the previous study, Bhushan and Walters (2014) demonstrated that the influence matrix method requires a minimum of 64 collocation points along Chebyshev's polynomial direction. Thus, for such grids a maximum of six sub-domains can be used in the y direction. Further, six OpenMP threads have been identified to be optimum for HPC systems with 2GB per processor memory. Thus, it is anticipated that the largest simulations will be performed on up to 900 processors. This is at par with the other applications in the literature, e.g. transition simulations by Wu et al. (2017) and turbulent simulations by Bobke et al. (2016), where both used 1024 processors.

DNS of turbulent channel flow
DNS for Re τ = 180 is performed using a 128 × 129 × 128 grid. The simulation domain and grid used for this case is same as M-DNS, and the grid resolution is such that x + ∼ 18 and z + ∼ 6. The simulation is performed using t = 10 −3 U c /H for 50K iterations and time averaging is performed over the last 20K iterations. DNS for Re τ = 590 is performed using 256 × 257 × 256 (G2) and 384 × 257 × 256 (G1) grids. The domain used for this case is same as M-DNS, but there are some differences in the grid resolution. M-DNS used 384 × 257 × 384 points such that x + ∼ 10 and z + ∼ 5. Both the grids G1 and G2 have same resolution along y-direction as M-DNS; G1 and G2 have x + ∼ 10 and 14.5, respectively; and both G1 and G2 have z + ∼ 9. Panton (2001) recommended that DNS should use z + ∼ 5-7 and x + could be larger than z + . A literature review of the available DNS of channel flow (as discussed in the Introduction) show that studies have used x + ∼ 10-13 and z + ∼ 4-7. G2 has somewhat coarser streamwise grid resolution, but G1 resolution seems appropriate for DNS. Spanwise resolution is somewhat coarser than those required for DNS. The results improve with finer streamwise grid. As discussed below, predictions on G1 compare quite well with the M-DNS; thus finer spanwise grids are not investigated. Albeit, some differences are observed for the spanwise velocity skewness and flatness predictions, which could be due to coarseness of spanwise resolution. The simulations are performed using t = 5 × 10 −4 U c /H for 140K iterations and time averaging is performed over the last 40K iterations. The predictions showed peak error in the mass conservation at the Chebyshev polynomial domain interfaces. The peak and average divergence is 5 × 10 −4 H/U c and 2.5 × 10 −6 H/U c , respectively, for Re τ = 590. The solution convergence was considered reasonable considering small divergence levels.
Instantaneous velocity contours in Figure 5(a,b) show turbulent flow behavior. For wall-normal velocity, turbulence is more prevalent in the near-wall region and shows ejection events. The streamwise velocity fluctuations in the sublayer in Figure 5(c) shows longitudinal streaks of alternating positive and negative values. An analysis of the vortical structures shows counterrotating elongated quasi-streamwise structures emerging in-between the streaks. The counter-rotating vortices induce lifting on each other causing them to eject from the surface, and they merge to forms -type structures. The generation of the -type structures is consistent with varicose instability mechanism reported by Schlatter et al. (2008) for bypass transition flows. A comparison between the Re τ = 180 and 590 DNS and Re τ = 2003 LES (Bhushan & Walters, 2014) streaks show the following trends in the characteristics of the hairpin structures. With the increase in Re, the structures get stronger and are formed deeper in the log-layer, i.e. from y + ∼ 60 for Re τ = 180 to y + ∼ 200 Re τ = 2003. Their length increases from l + ∼ 150-200-300-400. Their planar tilt angle (α) with respect to the streamwise direction increases from α = 3-5°to 14-15°. An analysis of the local shear stress reveals that -u v shows Q4 and Q2 events on the bottom and top walls, respectively; -u w shows equally dominant Q1 and Q2 events; and −v w shows events in all four quadrants (figure not shown). Both -u w and -v w show equally dominant positive and negative events, which cancel each other resulting in zero stresses upon averaging. The characteristics of the coherent turbulent structures are consistent with those reported by Jeong, Hussain, Schoppa, and Kim (1997) for Re τ = 180.
Validation focuses on the prediction of the ensemble averaged mean streamwise velocity, second and higher order statics and spectra of the turbulent fluctuations, and TKE and stress budgets. As shown in Figures 6 and  7, ParSpectra mean flow, RMS, skewness (S) and kurtosis (F) predictions compare within 0.05%, 0.23%, 0.6% and 1.1% with M-DNS for mean flow, RMS, skewness (S) and kurtosis (F), respectively. The large errors in kurtosis is primarily due to F(w ) predictions. The averaged errors are 0.3% and 0.6% for Re τ = 180 and 590, respectively. The Re τ = 180 mean velocity profile shows higher velocity in the log-layer than Re τ = 590, which is due to low-Re effect. For both Re, the peak u rms occurs in the buffer layer y + ∼ 10-15, whereas peak v rms and w rms and shear stress −u v occurs in the lower log-layer.
The v rms /u rms ratio increases steadily throughout the loglayer, whereas w rms /u rms is almost constant. S(u ) profile shows positive values near the wall and negative away from the wall, which suggest the presence of high and low speed turbulent fluctuations near and away from the wall, respectively. S(v ) is negative locally in the buffer layer due to strong ejection events. S(w ) is almost zero because of the flow symmetry. The flatness profiles for all turbulent velocities show high values near the wall, where F(v ) is the largest and F(u ) is the smallest. The flatness approaches ∼ 3 away from the wall suggesting that the turbulence is isotropic in nature.
Energy spectra in sub-and buffer-layers in Figure 8 (which are only planar averaged) shows that most energy is contained in u followed by w and v . The energy is mostly in the lower wave-numbers κ ≤ 20 and shows a κ −7 scaling for higher wave numbers, where the latter is consistent with exponential decay in the dissipation range. The lower wavenumbers (κ ≤ 20) show almost equal energy (i.e. all dominant length scales of the fluctuations have similar amplitude), which explains the high flatness in the near wall region. Away from the wall, all the three velocity fluctuations show similar energy content, which validates that the isotropy of the turbulence. The spectra show an inertial subrange with a κ −5/3 scaling in the log-layer and the subrange extent grows with the increase in the wall distance. The inertial subrange is followed by a κ −9/2 spectra slightly shallower than the dissipation range spectra.
The TKE budget in Figure 9(a) shows that the viscous diffusion is the source term and viscous dissipation is the sink term in the sub-and lower buffer-layers. Production and turbulent transport start to develop in the buffer layer. For both Re, the peak production is observed at y + ∼ 14, and the peak dissipation occurs at the wall. As shown in Figure 9(b), the most dominant dissipation term is ε 2 = 1 Re (∂u 1 /∂x 2 ) 2 , i.e. the contribution due to the wall-normal gradient of u . The other dominant terms are ε 3 = 1 . The former peaks near the wall and the latter peaks in the buffer layer. P/ ∼ 1.8 and ∼ 1 in the buffer-and log-layers, respectively. In the log-layer, production and dissipation are the dominant terms, whereas pressure transport is mostly small, and pressure-strain is zero. Stress budgets in Figure 9(c-f) show that v v generation near the wall (y + < 7) is due to pressure-diffusion and is balanced by the pressure-strain. In the buffer-layer, v v interacts with u v to produce u u . The turbulent energy is then redistributed from u u to w w and v v via pressure-strain in the lower log-layer. Energy decay is mostly from u u followed by w w and least from v v . The energy transfer mechanism is consistent with that reported by Voke and Yang (1995) for bypass transition flows.

Temporal DNS of bypass transition flow over flat-plate boundary layer
Boundary layer flow involves several regimes as discussed by Morkovin (1988). The regimes that cannot sustain self-energizing turbulent motions are referred to as 'laminar', i.e. if flow in this regime are locally exposed to turbulence, the instabilities decay rapidly. 'Buffeted laminar' flow is obtained when the laminar boundary layer at low Reynolds numbers (let's say below Re min ) is exposed to turbulence bursts with length scales smaller than the boundary layer thickness δ. In this regime, the boundary layer experiences some three-dimensional motions with scales larger than δ and the wall stress and heat transfer characteristics start to deviate from the laminar boundary layer. However, the flow remains intrinsically laminar. When the laminar boundary layer flow above a critical Reynolds number Re cr is exposed to small disturbances, the boundary layer undergoes 'natural transition' from laminar to turbulent states. Under these conditions, amplification of selected disturbance modes results in secondary instabilities, and eventually transition occurs around Re tr , where Re tr > Re cr > Re min . The critical and transition Reynolds number for natural transition are well-defined using linearized Navier-Stokes equations. In cases where laminar boundary layer is exposed to large external disturbances (such as free-stream turbulence intensity, local wall distortions/roughness, pressure gradients etc.), Re tr does not appear to be related to Re cr and may even occur upstream of Re cr . Under such conditions, the transition to turbulence bypasses the welldeveloped linearized instability theory. This type of flow is referred to as 'bypass transition flow'. Schlatter et al. (2008) reported that the bypass transition occurs primarily due to the generation of varicose-like secondary instabilities. Fully developed 'turbulent' regimes are obtained when the unsteady energy dissipated near the wall is restored through intermittent turbulent bursts, the latter manifests as production of negative Reynolds stress -u v at the expense of energy from the mean flow.
Bypass transition due to free-stream turbulence has been well studied in the literature, and studies have primarily used spatial approach, i.e. the entire plate is gridded, and simulations are performed using inlet and outlet boundary conditions (Jacobs & Durbin, 2001; Wu et al., 2017; Wu & Moin, 2009;Zaki & Durbin, 2005). An alternative to the spatial approach is the 'temporally developing' approach, wherein a simulation is started from an initial condition and a periodic boundary condition is used along the streamwise direction. This allows the simulation domain to move along with the flow as illustrated in Figure 10(a). The solution at any instant is expected to be a realization of an infinitesimal section of the flat-plate boundary layer. Thus, temporal simulations require a smaller streamwise domain compared to the spatially developing counterparts and are expected to be computationally inexpensive. Further, they allow straightforward application of the numerically accurate pseudo-spectral solvers, especially FFT in the streamwise direction. Temporal approach has been applied for DNS/LES of flows involving either quasi-steady state or slowly developing turbulent regimes, for example, plane channel, mixing layer and jet flows (Akhavan, Ansari, Kang, & Mangiavacchi, 2000;Bhushan, Warsi, & Walters, 2006;Moser et al., 1999), and turbulent flat-plate boundary layer flows (Bobke et al., 2016;Kozul et al., 2016;Zhang et al., 2018). He and Seddighi (2013) simulated the growth of the turbulent boundary layer in a plane channel from a low Re τ = 180 to a high Re τ = 420. The results showed that the flow undergoes bypass transition similar to those expected for flat-plate boundary layer.
Herein, temporal DNS is applied for bypass transition flows. The simulations are started with an initial flow condition obtained by superimposing Blasius mean streamwise velocity (U 0 ) at Re θ = 106 with turbulent fluctuations. To obtain the initial turbulent fluctuations, an artificial isotropic turbulence flow field was first generated in a cubic domain with uniform grid resolution using Rogallo (1981) approach with von Karman energy spectra: where κ is the wave number, 0 = 1.85δ 0 is the integral length scale, and Tu 0 = 5%. The turbulent velocities were interpolated onto the simulation grid and damped near the wall. Next, a simulation was performed using just the turbulent fluctuations using the flat-plate boundary conditions. The simulation was continued until the velocity skewness, kurtosis, and velocity derivative skewness converged to 0, 3, and −0.29 (expected for isotropic turbulence) (Pope, 2000), respectively, in the regions away from the wall. The amplitudes of the fully developed turbulence were then scaled to the required Tu in for the simulations. The initial turbulence and Tu wall normal profile are shown in Figure 10(b,c). The initial mean flow profile, 0 , and Tu in = 3.5% were selected to match the inflow conditions used in JD-DNS or in the experiment (Roach & Brierly, 1990). Simulations were performed using five different domain sizes: (20δ 0 ) 3 , (25δ 0 ) 3 , (30δ 0 ) 3 , (40δ 0 ) 3 , and (50δ 0 ) 3 , where δ 0 is the initial laminar boundary layer thickness. The simulations for the first three domains were performed using 192 × 193 × 192 grid. The simulation on (50δ 0 ) 3 domain was performed using a finer grid with 256 × 257 × 256 points. The simulations on (40δ 0 ) 3 domain was performed using both the grid resolutions. Note that the JD-SDNS used a domain size of 620δ 0 × 40δ 0 × 30δ 0 along streamwise, wall normal and spanwise directions, respectively, which were discretized using 2048 × 180 × 192 cells. Figure 11. Temporal DNS predictions of: (a) C f ; (b) displacement thickness-based Reynolds numberRe * ; and (c) shape factor H in boundary layer coordinates obtained on different domain sizes for Tu in = 3.5%. Results are compared with JD-SDNS, experimental data (Roach & Brierly, 1990), and analytic profiles.
Figure 10(d) shows the contours of instantaneous streamwise velocity at the spanwise center plane z = 0 every 2000 iterations or Δt = 1U 0 /L side-by-side. The contours provide a visual inspection of the boundary layer growth. As evident, boundary layer grows steadily in the laminar regime. The growth is rapid in the transition regime, wherein the near-wall region shows acceleration of the flow due to the generation of turbulent fluctuations. The near-wall low velocity region in the turbulent regime is thinner than those in the laminar regime. This is expected as the former involves sharper near-wall gradient than the latter. The growth of the boundary layer agrees qualitatively well with those expected for flat-plate.
The growth/decay of the integral boundary layer parameters, namely skin friction coefficient C f , displacement thickness δ * , momentum thickness θ , and shape parameter H = δ * /θ, are presented in Figure 11 in the boundary layer coordinates Re θ . C f profile shows an initial decrease up to Re θ = 300 (laminar region), then a gradual increase up to Re θ = 650 (transition region) and an unsteady behavior with gradual decay for larger Re θ (turbulent region). The predictions compare quite well with the JD-SDNS results and experimental data. Note that the temporal results show unsteadiness in the turbulent region, whereas spatial DNS is steady. This is because the former shows the instantaneous (plane averaged) values, whereas the latter show the ensemble averaged values. The results on different domain sizes show very similar pattern in the transition and turbulent regions, and the small variations could be attributed to the changes in grid resolution and free-stream turbulence length scale. However, they show significant differences in the laminar region. The smaller domains somewhat over predict C f , whereas the larger domains compare better. This could be because the smaller domains are not large enough to maintain periodicity of flow. δ * and H predictions also compare well with validation datasets, and the best predictions are obtained using 30δ 3 0 and 40δ 3 0 domain sizes.

Conclusions and future work
Convergence and scalability of a pseudo-spectral solver with hybrid MPI/OpenMP parallelization is studied, and the solver is validated for temporal DNS of turbulent channel and flat-plate bypass transition flows. The solver shows second-order temporal accuracy, and spatial order of accuracy of 6.8 and 3.6 for 2D laminar and turbulent flow simulations, respectively. The solver displays reasonable scalability up to 8K processors on grids with 800M grid points, and it is estimated that the best computational performance is achieved for simulations using 100-400 K grid points per processor on systems with 2GB memory per processor. The mean and turbulent channel flow predictions, including higher order statistics and TKE and stress budgets, compare within 0.5% of the available benchmark results. Preliminary temporal DNS predictions of flat-plate bypass transition flows are encouraging and validates its viability as an inexpensive alternative to the spatial approach, as it requires an order of magnitude smaller streamwise domain (and grids). However, further investigation is required for direct comparison between the temporal and spatial DNS in plate coordinates, i.e. in terms of Re x . This will require estimation of domain translation velocity. In addition, temporal simulations do not allow ensemble averaging similar to spatial simulation, which needs to be investigated. The most efficient approach seems to be combination of averaging over homogenous directions and time-window averaging (as proposed by Kozul et al., 2016) along with multiple simulations with different initial conditions. The use of multiple simulations is expected to increase the computational expense, and adversely affect the advantages of the approach.