MAST-RT0 solution of the incompressible Navier–Stokes equations in 3D complex domains

A new numerical methodology to solve the 3D Navier-Stokes equations for incompressible fluids within complex boundaries and unstructured body-fitted tetrahedral mesh is presented and validated with three literature and one real-case tests. We apply a fractional time step procedure where a predictor and a corrector problem are sequentially solved. The predictor step is solved applying the MAST (Marching in Space and Time) procedure, which explicitly handles the non-linear terms in the momentum equations, allowing numerical stability for Courant number greater than one. Correction steps are solved by a Mixed Hybrid Finite Elements discretization that assumes positive distances among tetrahedrons circumcentres. In 3D problems, non-Delaunay meshes are provided by most of the mesh generators. To maintain good matrix properties for non-Delaunay meshes, a continuity equation is integrated over each tetrahedron, but the momentum equations are integrated over clusters of tetrahedrons, such that each external face shared by two clusters belongs to two tetrahedrons whose circumcentres have positive distance. A numerical procedure is proposed to compute the velocities inside clusters with more than one tetrahedron. Model preserves mass balance at the machine error and there is no need to compute pressure at each time iteration, but only at target simulation times.


Introduction
The Navier-Stokes Equations (NSEs) govern external and internal flows in many real-life industrial, environmental and biological problems (e.g. aircraft and ship problems, rotomachinery blade applications, hemodynamic and biological flows). The challenging research topics involved in the numerical solution of such mathematical problems mainly concern the choice of the velocity-pressure coupling algorithm and the design of the computational mesh discretizing the physical domain.
In compressible flows, pressure and density are linked by the state algebraic equation. In contrast, in incompressible flows, density is constant and pressure has to be solved along with velocities from all the momentum and continuity equations. With this motivation, the numerical algorithms used to solve the NSEs can generally be divided into density-based and pressure-based solvers (Shyy & Mittal, 1998;Tao, 2001). Density-based solvers are commonly applied to high-Ma compressible flows, while pressure-based solvers were originally proposed for incompressible flows, and then successfully extended to compressible flows (Tao, 2001).
CONTACT Costanza Aricò costanza.arico@unipa.it Two approaches are generally used in pressure-based solvers, namely the direct (or coupled) approach and the segregated approach (e.g. Mazhar, 2016 and cited references). In the first case, the whole set of momentum and continuity equations is solved simultaneously, resulting in a strong coupling between pressure and velocity. The main drawback is the large amount of required computational effort and computer memory, which makes this approach unsuitable to facing many practical engineering applications (Darwish et al., 2015). In the segregated approach, pressure and velocity are solved separately and sequentially, using previously computed values of the other dependent variable. The core of the problem is how to update the pressure field so that the divergence-free velocity field condition is satisfied.
Several numerical procedures can be distinguished in the segregated approach, e.g. the projection method (Chorin, 1968;Kim & Moin, 1985), the penalty method (Braaten & Shyy, 1986;Hughes et al., 1979), the artificial compressibility method (Harlow & Welch, 1965;Malan et al., 2002;Vrahliotis et al., 2012), and the pressure-correction method (Ozoe & Tao, 2001;Patankar, 1980). The fractional step projection method (Chorin, 1968;Kim & Moin, 1985) and the SIMPLE (Semi-Implicit Method for Pressure-Linked Equations) pressurecorrection method (Patankar, 1980(Patankar, , 1981 have become very popular. In both methodologies, a pressure correction is introduced to improve the velocities computed from the solution of the momentum equations, and to satisfy the continuity equation. Projection methods, commonly regarded as fractional-step methods, can be classified into pressure-correction methods, velocitycorrection methods and consistent splitting methods (Guermond et al., 2006). A sequence of decoupled elliptic equations for velocity and pressure has to be solved at each time step, and this represents one of the most attractive features of such procedures, especially for simulations of large-scale problems (Guermond et al., 2006). In the projection method, the pressure correction Poisson equation is solved once per time step, while for the SIM-PLE method, the momentum and pressure correction equations are solved several times in each time step. For these reasons, projection methods could handle unsteady flow problems more easily than SIMPLE.
The SIMPLE-family algorithms (e.g. SIMPLER, SIM-PLEC, SIMPLEX) are extensions of and improvements to the SIMPLE algorithm. Other pressure correction methods can be regarded as further extensions of the SIMPLE-family models, for example PISO and CLEAR (e.g. Aguerre et al., 2020 and cited works). Segregated solvers like the SIMPLE-family algorithms have shown poor convergence, especially when used for swirling flow fields (Hanby & Silvester, 1996). In these problems, where the coupling between radial and tangential momentum equations is strong, the linearization of the momentum equations leads to a sequential solution of these equations, without accounting for the coupling between the momentum equations and the velocity components (Hanby & Silvester, 1996). Even though several 'ad hoc' procedures have been presented in order to overcome the above-mentioned convergence problems (e.g. Gosman et al., 1976), the need of parameter calibration for convergence of the iterative procedure requires a high computational effort, which limits the application of these algorithms. These reasons, along with the increase of computer performance and memory, have motivated several authors towards new coupled approaches (e.g. Darwish et al., 2009 and cited references).
Other options have been proposed in the literature, e.g. the vorticity-stream function methods, where pressure is eliminated from the governing equations and velocity and pressure are replaced by the vorticity and the stream function (e.g. Calhoun, 2002 and cited references). Two main reasons have limited their use: the difficulty of handling wall boundary conditions and the difficulties arising in the extension from 2D to 3D problems (Calhoun, 2002).
It is widely recognized that spatial discretization of the incompressible flow equations on collocated grids leads to unphysical odd-even coupling of the pressure (i.e. the so-called spurious checkerboard modes) (e.g. Dalal et al., 2008;Perron et al., 2004). Staggered grids have been one of the most common ways to overcome these problems, where different grid points for velocity and pressure are used (e.g. Harlow & Welch, 1965;Perot, 2000).
In the last decade, the use of unstructured grids has become popular due to their capacity to discretize real arbitrary domains and to easily get local refinement. Geometric complexity is a drawback for a straightforward extension of the staggered mesh to unstructured grids and most solvers require the addition of terms which could generate unphysical solutions and loss of mass conservations (e.g. Perot, 2000). Moreover, the time step limitation required by the Courant condition can become very severe, due to the existence of few computational elements much smaller than the average ones.
In the past few decades, Finite Volume Methods (FVMs) (e.g. Kim & Choi, 2000;Mathur & Murthy, 1997;Perron et al., 2004;Plana Fattori et al., 2013;Vidovic et al., 2004) and Finite Element Methods (FEMs) (e.g. Bazilevs et al., 2013;Fortin, 1981;Pai et al., 2013;Zienkiewicz et al., 2013) have been preferred to other methods (e.g. finite difference methods) in handling irregular boundaries, because they can use unstructured triangular or tetrahedral meshes, which can easily discretize arbitrary geometries. A large multitude of FVMs has been proposed in the last few decades, with different choices of control volumes, both for collocated and staggered grids. Many of these proposed techniques suffer from a 'non-orthogonality condition', as pointed out in (Gao et al., 2012), in discretization of the second-order partial derivative terms in the momentum equations and in the pressure-correction Poisson equation. Hybrid FV/FEMs (e.g. Busto et al., 2018;Gao et al., 2012) take advantage of both FVMs and FEMs, discretizing the momentum equations with the FVMs and the Poisson equation with the FEMs. The Discontinuous Galerkin Finite Element Methods (DG-FEMs) (e.g. Bassi & Rebay, 1997;Lehrenfeld & Schöberl, 2016 and cited references) combine the performing features of the FEMs and FVMs. Like the classical FEMs, the DG-FEMs achieve high-order accuracy using high-order polynomial approximation within an element rather than using wide stencils, as in the case of finite volume methods. The physics of wave propagation is however correctly accounted for by the solution of a Riemann problem (e.g. Toro et al., 2020 and cited references) arising from the discontinuous solution at element interfaces, which makes them similar to FVMs (e.g. Toro & Vazquez-Cendon, 2012). The main drawback of DG-FEMs, compared to the classical FEMs and FVMs, is that they require solutions of systems of equations with more unknowns for the same grids, and have been recognized to be very demanding in terms of both computational costs and storage requirements.
Recently, mesh-free methods have received increasing attention from researchers, due to their ability to solve flow problems with complex geometries or involving multi-phases fluids. The Smoothed Particle Hydrodynamics (SPH) method is a pure Lagrangian, mesh-free procedure, originally proposed for the solution of astrophysical problems (Gingold & Monaghan, 1977;Lucy, 1977). The fluid domain is discretized by a set of moving particles, and the governing equations are solved for each of them. Each particle has a support domain with a characteristic spatial distance over which their physical properties (namely pressure and velocity) are 'smoothed' by means of a kernel function, generally with a Gaussian-type shape. Thanks to its quite easy implementation, this procedure has largely been applied for the solution of compressible and incompressible NSEs (e.g. Oger et al., 2007 and cited references). One of the main drawbacks of the Lagrangian schemes is that the relations among particles have to be updated at each time step, and the topology setting of the matrix system of the Poisson pressure-correction equations has to be performed at each time iteration, as well as its factorization operations.
During the last years, Virtual Elements Methods (VEMs) (e.g. Beirão da Veiga et al., 2018, and cited references) have been proposed as a new FEM paradigm to solve Partial Differential Equations. VEMs construct a conforming Galerkin FE scheme, dealing with general polygonal/polyhedral mesh elements, also with nonconvex shape. VEMs have been recently applied for the solution of the NSEs in 2D problems (e.g. Beirão da Veiga et al., 2018, and cited references), but their extension to 3D problems is still limited to very simple domain geometries and boundary shapes (Liu & Chen, 2019).
In the present paper, we propose a new finite volume solver for the solution of 3D incompressible NSEs over unstructured tetrahedral meshes. We apply a fractional time step procedure where a predictor and a corrector problem are sequentially solved. The procedure presents substantial differences compared to the fractional step procedures presented in the literature, based upon the common projection procedures (e.g. Guermond et al., 2006;Perron et al., 2004). The predictor step (PS) is carried out by applying the Eulerian Finite Volume MAST (Marching in Space and Time) numerical procedure, recently proposed for the solution of shallow waters and groundwater problems (Aricò et al., 2007;Aricò et al., 2011Aricò et al., , 2012Aricò et al., , 2013aAricò et al., , 2013bAricò & Tucciarelli, 2007a, 2007b. In this step, all the terms are retained in the momentum equations. The major advantages of MAST are the following: (1) it explicitly handles the nonlinear momentum terms in the momentum equations, by means of the sequential solution of a three-variable Ordinary Differential Equations (ODEs) system for each computational cell, with a computational effort which is simply proportional to the number of tetrahedral elements, (2) it provides numerical stability with respect to large Courant-Friedrichs-Lewy (CFL) numbers, that can be much greater than one, at a cost of local accuracy reduction (Aricò et al., 2007;Aricò & Tucciarelli, 2007a, 2007b. The correction step is split into two parts, named CS1 step and CS2 step. In CS1 step, three linear systems are solved for the three velocity components (one system for each velocity component), which update the viscous terms in the momentum equations. In the second corrector step (CS2), a single linear system is solved for the pressure correction unknown. The matrices of the systems of CS1 and CS2 steps are well conditioned, as they are sparse, symmetric and diagonally dominant, and lead to a very fast solution of the associated systems by the use of a preconditioned conjugated gradient solver. The matrix coefficients are constant in time, computed and factorized only once at the beginning of the numerical simulation. This makes it possible to save a lot of computational time, compared to other numerical schemes (e.g. Lagrangian schemes).
Both CS1 and CS2 steps are solved assuming a mass lumping Mixed Hybrid Finite Element (MHFE) (e.g. Auricchio et al., 2017, and cited references) discretization inside each tetrahedron, similar to the one proposed by Younes et al. (2006). The mass lumping option has been chosen because it is easy to be used together with tetrahedral elements. To maintain good convergence and accuracy properties, our MHFE scheme assumes the distance among circumcenters to be positive, a condition which is always satisfied in the Delaunay meshes. Unfortunately, in 3D problems either bad-quality Delaunay or non-Delaunay meshes are provided by most of the available mesh generators (Li & Teng, 2001). To cope with this problem, and to use non-Delaunay meshes still saving the previously mentioned good matrix properties, in the present procedure, a continuity equation is integrated over each single tetrahedron, but the momentum equations are integrated over clusters of tetrahedrons, such that each external face shared by two different clusters is part of two tetrahedrons whose circumcenters have positive distance. To define the velocities inside each cluster with more than one tetrahedron correctly, the minimum energy condition is finally enforced for velocities inside the clusters.
In the proposed numerical solver, mass balance is always preserved at the error machine precision, and there is no need to update the pressure values at each time iteration, because only the pressure gradient appear in the NSEs. We compute the spatial pressure distribution only at target simulation times.
The paper is organized as follows. In section 2.1, we present the governing equations and the projectioncorrection formulation of the problem and in section 2.2 the spatial discretization. In sections 3.1, 3.2, and 3.3, we present the numerical details of the solution of respectively the PS, CS1 and CS2 steps as defined in section 2.1. In section 4, we present the extension of the previous algorithms to the real case of 3D non-Delaunay meshes and in section 5, we show the model application to three well-known literature tests and one real-life case, as well as an analysis of the associated computational costs.

Governing equations and fractional time step discretization
We solve the 3D Navier-Stokes Equations (NSEs) for a real and incompressible fluid, where Equations (1) are the momentum conservation equations, Equation (2) is the mass conservation equation, t is time, ν kinematic viscosity, u the velocity vector whose components are u, v, and w, along the x, y and z directions respectively, and ψ is the kinematic pressure p/ρ, where p is the fluid pressure and ρ is the fluid density, constant in space and time for an incompressible fluid. The governing equations are solved for the u and ψ unknowns. The problem is well-posed if we correctly assign the initial and boundary conditions (ICs and BCs, respectively). With respect to the BCs, we assign either (a) all the velocity components (essential BCs) or (b) the stress vector (natural BCs) or (c) a combination of the previous ones called free-slip BC. Let be the computational domain and its boundary surface, and let us call u , σ and m the three non-overlapping portions, where BCs (a), (b) and (c) respectively apply, such that = u + σ + m . We formulate the BCs as where x is the coordinate position vector, g is the velocity vector assigned at the boundary, n is the unit outward vector normal to the boundary, and t 1 and t 2 are two other orthogonal unit vectors such that n, t 1 and t 2 give the reference frame attached to the tetrahedron face, τ is the stress tangent vector component in the t 1 -t 2 plane, u n , u τ 1 and u τ 2 are the corresponding u components. In the following sections, for simplicity's sake we assume only hydrostatic stress occurring along σ , which is equivalent to assuming the stress normal to the boundary plane and to neglecting all the viscous terms in Equation (3b). The assigned ICs on the system, in the¯ (¯ = ∩ ) domain are u = u 0 with ∇ · u 0 = 0 and = 0 at t = 0 (4) As mentioned in section 1, we apply a predictor-corrector projection procedure, sequentially solving one predictor and two corrector problems. The predictor and the first corrector steps deal with the momentum equations, while in the second corrector step we combine the mass and momentum conservation equations to enforce the divergence-free condition.
In the next sections, time levels t k , t k+1/3 , t k+2/3 , t k+1 represent the beginning of the generic time iteration, the end of the prediction step, as well as the end of the first and second corrector steps, respectively, and t k+1 also marks the end of the time iteration. Superscripts k, k+1/3, k+2/3 and k+1 mark the values of the variables (i.e. u and ψ) at the corresponding time levels.
In vector-matrix form, we write the momentum Equations (1) as In the framework of a fractional time step procedure, we set and we split Equation (5a) into where E k−1/3 marks the matrix E computed at the end of the first correction system of the previous time iteration and B k marks the pressure gradient term at the beginning of the new time step. Functional analysis easily shows that, due to the stationarity of the pressure gradient term, Equation (7a) form a fully convective system, with only one characteristic line passing through the (x, t) point, while the systems in Equation (7b) and (7c) are fully parabolic (Aricò et al., 2007;Aricò & Tucciarelli, 2009). Integrating in time, from Equation (7) we get where system (8a) is the prediction step (PS), (8b) and (8c) are the first and second correction systems (CS1 and CS2), respectively. Summing systems (8), the integral of the original system (1) is formally obtained. Further details on time discretization will be given below in section 3.

RT0 spatial tetrahedral discretization of pressure and velocity
We discretize the computational domain by means of N T non-overlapping tetrahedrons (named also elements) and assume the velocity field in each tetrahedron e, u e (x) ∈ X e , where X e is the lowest-order Raviart-Thomas (RT0) space (Raviart & Thomas, 1977), such that where ω e j are the space basis functions of X e , W e is the volume of tetrahedron e, x e j is the coordinate vector of the jth node of e, and Q e j is the volumetric flux crossing the face of e opposite to the jth node. Important properties of the space X e are that ∇ · u e is constant over e, u e · n j is constant over each face j of e, and n j represents the unit outward vector orthogonal to face j of tetrahedron e (Raviart & Thomas, 1977). As a result of these properties, each velocity component of the RT0 discretization is piece-wise linear inside each element, and a constant velocity can occur only if the sum of the four fluxes is equal to zero. In this case the divergence is zero inside each element and mass continuity is both locally and globally conserved if the fluxes of two neighboring elements are always opposite one another in the common face.
In PS and CS1, the pressure gradient B is kept constant. In CS2, its correction is computed as the solution of a conservative problem, where the velocity is opposite to the pressure gradient. We will show in the following that, at the end of each step PS, CS1 and CS2, the computed velocity is piece-wise constant in each element, but the fluxes Q e j in Equation (9) are opposite at the common face of two neighboring elements only at the end of CS2. Assuming a piece-wise linear pressure as initial condition, because the CS2 velocity correction is conservative with respect to the pressure correction, the pressure gradient remains piece-wise constant in all the steps. It can also be shown that the pressure correction computed in CS2 step by the MHFE is continuous along the element faces only in their circumcenters. Because pressure changes only in CS2 step, the same condition holds also for the pressure at the end of each time step. See in Figure 1 the pressure contour lines inside the face common to two neighboring tetrahedrons.
The presentation of the solution of each step in Equations (8) is restricted first to the hypothesis of a mesh with an Extended Delaunay Property (EDP), as defined in Aricò et al. (2011), and then generalized in section 4 to the case of totally irregular meshes. In an EDP mesh, the circumsphere of any tetrahedron does not include any other node inside it (Forsyth, 1991;Joe, 1986;Letniowski, 1992) and, with reference to Figure 2, the following conditions hold for two neighboring tetrahedrons e and ep, where vectors v 1 and v 2 are defined as   circumcenter of the face shared by tetrahedrons e and ep. Moreover, the following condition holds for the boundary elements e and the corresponding boundary face v · n < 0 where n is the unit vector normal to the boundary face and oriented in the outward direction, v is defined as and x e cf is the coordinate vector of the circumcenter of the boundary face. In Figure 2(a,b), tetrahedrons e and ep satisfy and do not satisfy the EDP, respectively.
We will prove in the next sections that in an EDP mesh the system matrix, associated with the solution of the steps (8b) and (8c), is an M-matrix (Letniowski, 1992), and this avoids nonphysical local extrema in the computed solution (Forsyth, 1991;Letniowski, 1992). The same matrix property will be saved in the more general case of non-Delaunay meshes by changing the control volumes of the momentum equations, as will be shown in section 4.

Prediction step
The step (8a) is solved by assuming, in each tetrahedral element, constant viscous and pressure gradient forces and by integrating convective inertial terms according to the MArching in Space and Time (MAST) procedure. MAST computes the solution of convective problems, with only one characteristic line passing through each (x, t) point of the domain, in the framework of a fractional time step procedure.
In MAST the three discretized momentum equations are solved in each element, uncoupled from the other ones. To this end, at the beginning of each k th time step, all the elements are marked with an index R k e ∈R k , called 'rank'. The rank of an element e is an integer which is a unit greater than the ranks of all the neighboring ep elements with a common face crossed by a flux entering element e from element ep. All the elements are initialized at the beginning of each new time iteration with rank zero, and vector R k is computed starting from the boundary elements with all interior fluxes oriented outward, which are set with rank 1. The rank is then computed for the elements that are neighbors to elements with rank greater than zero and have no interior faces crossed by fluxes oriented inward. The procedure is continued until all the elements of the computational domain have rank greater than zero. After computation of R k , all the elements are sorted according to increasing rank values and solved one after the other.
During the solution of element e, the integral of the velocity momentum is computed and the mean momentum fluxes are added as external forces to the neighboring elements ep with entering flux and higher rank. The MAST solution of the prediction problem along time step k can be viewed as a reduction of the computational domain carried out after the solution of each boundary element, through extraction of the latter. This reduction makes it possible to solve originally internal elements according to the information carried on by characteristic lines rooted at the boundary of internal elements that behave like domain boundaries. In the example of Figure 3, cell 1 is solved using the information at x 0 carried on by the first characteristic line. After interpolation in time of the solution at the cell boundary in x 1 , the second cell is solved using information carried on by the second characteristic line rooted between t k and t k+1 at x 1 . In this way, the sought after solution is not subject to the Courant restriction on the maximum size of the time step.
In Aricò et al. (2013b) it has been shown that the basic requirement for the application of the MAST algorithm is the existence of a continuous 'anisotropic scalar potential' P of the flow field, such that the velocity can be computed as where K 0 is a (3 × 3) positive-definite tensor. In the same paper it is shown that for any incompressible and viscous fluid this potential always exists and streamlines are always open but, due to numerical discretization, it is possible in the computed velocity field to get one or more loops where a single element without a flux entering from other elements of the same loop does not exist. In this condition the rank vector cannot be computed and, to apply MAST, we need to select a 'cut' face between two elements of the same loop with a small flux, which has to be assumed constant during the time step and equal to its initial value. See in Appendix 1 a very fast procedure, named Order, aimed to compute the R k vector, where the flux entering from the 'cut' face of a loop is treated as a known boundary flux. After application of the Order algorithm, an open source subroutine 'QUICKSORT' in the package KB07 1 can be used to order the elements according to their rank values. After integration in space, the prediction step of Equation (7a) can be written, for any element e, as with the symbols specified as above. The first and second terms on the l.h.s. of Equation (14) represent the local and convective inertial terms, respectively, the third term is the force over the element due to the gradient of the kinematic pressure and the last term accounts for the effect of the viscous forces. From now on we will call j (in the local element reference, j = 1, . . . , 4) the face that tetrahedron e shares with its neighboring ep, and jp (in a local reference too) the face that ep shares with e. The symbols σ e j and n j represent the area and the unit outward orthogonal vector of the jth face of tetrahedron e, respectively.
Due to the cell sorting operation at the beginning of each time iteration, the solution of the momentum Equation (14) for element e with rank R k e , depends (a) on the incoming momentum fluxes from neighboring ep elements with R k ep < R k e , (b) on the kinematic pressure gradients and viscous terms, which are known from the solution of the previous time step, and (c) on the initial solution inside the element at the beginning of the new time step. For these reasons, the PDEs system in Equation (14) can be regarded as a small (3 × 3) Ordinary Differential Equations (ODEs) system to be solved for the velocity components in element e.
Starting from a piecewise constant in space (P 0 ) and divergence-free velocity distribution inside each tetrahedron e, u k e ∈ X e and u k e ∈ P 0,e , we assume that at the generic time τ inside the time step (0 ≤ τ ≤ t) the velocity value is where u f e (τ ) ∈ P 0,e and u f e (0) = 0. Applying the Green lemma to the second integral on the l.h.s. of Equation (14), the equilibrium ODEs system for tetrahedron e can be written as where M e,out j (τ ) is the leaving momentum flux from tetrahedron e to the neighboring tetrahedron ep, M e,in j is the mean incoming momentum flux entering tetrahedron e crossing the jth face, computed along with the solution of the previous elements, ϕ e j = 1 for faces shared by elements with higher rank or boundary faces with positive flux, otherwise ϕ e j = 0. S k ,e and VF k−1/3 e are the sum of the kinematic pressure and the viscous forces over the four faces of tetrahedron e, respectively, computed in the previous time step as specified in sections 3.2 and 3.3. ODEs systems (16) are sequentially solved, one for each element, starting from the tetrahedrons with the smallest R k e value, and proceeding to the tetrahedrons with higher R k e values. We call this step MAST forward step (MAST-fs). To solve system (16) from time 0 to time t, we use an explicit Runge-Kutta (RK) code (Brankin et al., 1993). This code adopts an internal time subgrid, selected within the interval [0t], on which the approximate ODEs solution is computed. The position of the nodes of the grid is automatically selected by the RK code according to a local error estimation (Brankin et al., 1993).
Due to the change of the velocity vector during the ODEs system solution we could obtain, in intermediate times between 0 and t, momentum fluxes moving from the element e into other ep elements with a lower rank (R k ep < R k e ). To avoid this, we approximate the leaving momentum flux in the MAST-fs forward step as with the symbols specified above. To restore the force integral neglected in the integration of Equation (14) due to the limit of the momentum flux assigned in Equation (17), after the end of the forward step we again perform the solution of the 3-ODEs system in sequential way, assuming as initial u e value the one computed at the end of the MAST-fs forward step. In MAST backward solution (MAST-bs) we start from the tetrahedrons with the highest R k e value and proceed to the tetrahedrons with smaller R k e values, saving only the inertial terms of Equation (14), that is: where τ l and w l are the time and the weight associated with the l th Gauss point, with 1 ≤ l ≤n G . The final velocity u k+1/3 is computed by Equation (18b) for τ = t. u k+1/3 is piecewise constant and divergence-free inside each tetrahedron (u k+1/3 e ∈ P 0,e ), but the fluxes crossing the same face of two neighboring elements will not be opposite for the two elements in the computed solution and this disrupts mass conservation at time level t k+1/3 .
In the MAST-fs problem, we compute the boundary momentum fluxes of tetrahedron e, different from zero, as where g j,e is the velocity vector assigned on the face j ∈ u of element e.
In the MAST-bs problem, we compute the same fluxes as

Parallel solution of the MAST prediction step
The solution of the MAST PS in 1D problems is inherently serial and cannot be achieved with parallel computing. On the opposite, in 2D and 3D problems it could be possible to carry out simultaneously the solution of all the elements with the same rank using several CPU physical processors, just saving the average entering momentum fluxes computed for each element in both the forward and backward steps. See in Figure 4(a) the scheme of the MAST-fs solution, for a computer with five processors, of a single time step of a model with 13 elements, ordered in two groups of rank 1 (7 elements) and rank 2 (6 elements). The white circles represent the initial value of the variables and the black circles the final one. T T is the computational time required to each processor for the solution of one element. Observe that, due to the need of solving all the elements with rank 1, before proceeding to the other ones, some processors have to solve 4 elements before moving to the next time step, where in any traditional marching in time method (MTM) only a maximum of 3 elements would be required, as shown in Figure 4(b). An additional time is also required, for each time step, for the rank computation and the element ordering in the MAST-fs (see Figure 4(a)). These last operations, as better specified in the tests run in section 5 with millions of elements, require a computational time two or three order of magnitude smaller than T T . For meshes with at least a few hundred thousand elements, the maximum number N r,k of elements within a single rank R k e at time step k is usually much larger than the available physical processors in standard computers. For test 3 (in section 5.3.3), we estimate N r,k and predict the corresponding computational time of parallelization of the MAST algorithm for different number of processors.

The CS1 correction step
By integrating Equation (8b) in space, we obtain the following system, to be solved for the three unknown velocity components u k+2/3 e ∈ P 0,e , from time level t k+1/3 to time level t k+2/3 , Application of the Green lemma to the volume integral on the l.h.s. of Equation (23)   where ∂u ∂n is the derivative of the velocity vector along the direction orthogonal to face σ e j , d e,ep is the distance of the circumcenters of the two neighboring tetrahedrons e and ep, computed as specified in Equation (25), and the other symbols have already been defined. We compute the distance as where vectors v 1 and v 2 have been defined in Equation (10c). For each tetrahedron e we set such that the r.h.s. of Equation (24) becomes, On the r.h.s. of Equation (23) We set and merging Equation (29) with Equation (27) and solved for the components of the velocity correction u e . The matrix of system (31) (32) and the e-th element of the source term vector is the r.h.s. of Equation (31).
For the boundary face j ∈ u of tetrahedron e, the velocity at the circumcenter of the neighboring tetrahedron is replaced in the system of Equation (31) by the boundary velocity at the circumcenter of the boundary face and the distance d e,ep with the distance d e j between the tetrahedron and the face circumcenter, defined as where v and n have been already defined in Equations (11) and (12). If the boundary face belongs to u and condition g j,e (t k+1/3 ) · n j ≤ 0 holds, the velocity correction u j at the boundary face, from t k+1/3 to t k+2/3 , is equal to zero along with the corresponding off-diagonal matrix coefficient in Equation (32), and the contribution to the source term becomes If the boundary face belongs to u and condition g j,e (t k+1/3 ) · n j > 0 holds, the boundary velocity at time t k+1/3 is assumed to be the tetrahedral element velocity u If the boundary face belongs to σ or to m , for simplicity's sake we neglect the viscous tangent stress components, along with the terms in Equation (31) proportional to the viscosity.
If the EDP is satisfied (see section 2), due to Equations (11)-(12), the diagonal and off-diagonal coefficients in Equation (32) are positive and non-positive, respectively, such that the M-matrix property of the matrix is always guaranteed (Forsyth, 1991;Letniowski, 1992).
We adopt a preconditioned conjugate gradient method to solve system (31), using incomplete Cholesky factorization 2 and a compressed row storage (CRS) format, which makes it possible to save a lot of computational memory allocation (we store the main diagonal and the upper off-diagonal matrix coefficients). Since the matrix coefficients only depend on the geometrical variables, as well as the value of the kinematic viscosity, the matrix is only factorized once, before the time iterations loop starts, and this makes it possible to save a lot of computational effort.
After system (31) is solved, the velocity in each tetrahedron is updated according to Equation (26). At time level t k+2/3 for each tetrahedron e we compute the sum FV k+2/3 e of the viscous forces over its four faces as and we neglect the change in it occurring along the next CS2 correction step. At the end of the CS1 problem, the continuity of the fluxes crossing the same face of two neighboring elements e and ep is not yet restored, and, similarly to u k+1/3 e at the end of the PS problem (section 3.1), u k+2/3 e ∈ P 0,e , but mass conservation is not satisfied.
The spatial discretization of the derivative ∂u ∂n proposed in Equation (24) is similar to the one presented by Younes et al. (2004Younes et al. ( , 2006 for the 2D Mixed Hybrid Finite Element method lumped in the circumcenter of triangles. Unlike the formulation of the present work, Younes et al. (2004) proved that their corresponding 3D discretization exists only for regular tetrahedrons, and cannot be extended to a general 3D tetrahedral discretization.

The CS2 correction step
Substituting Equations (5b) in Equations (8c), we get which has to be solved to restore the flux continuity disrupted in the prediction and in the first correction steps.
To this end we set where η is an unknown function with the same dimensions as ψ, and u k+2/3 RT0 ∈ X e is the RT0 velocity at time level t k+2/3 computed by Equation (9)  is opposite for the two neighboring tetrahedrons e and ep and the weight of each flux is inversely proportional to the volume of the corresponding element. Because the common face can be thought of as the basis of the two tetrahedrons, this is equivalent to assuming an inverse proportionality of the flux with respect to the corresponding height of the tetrahedron. According to Equation (9) Taking the divergence of Equation (41), we get Integration of Equation (42) and application of the Green lemma leads to where the first term is the flux of the vector is defined in Equation (39), Fl k j,e is the flux crossing face j of element e due to velocity u k ∈ X e and Fl e is the total flux of the vector (u k+1 -u k ) crossing the surface of element e. For the gradient of η we adopt the spatial discretization already applied in section 3.2 for the velocity gradient (see Equation (24)).
To compute η, we impose the condition that both u k+1 and u k are divergence-free, with u k+1 ∈ X e and u k+1 ∈ P 0,e , which implies that in Equation (43) the flux Fl e of the velocity (u k+1 -u k ) crossing the total surface of the tetrahedron is zero (see in Figure 5 a 1D sketch of velocity vectors inside tetrahedron e). The resulting system is (44) Equations (44) form a well-conditioned linear system to be solved for the η unknowns. The matrix of the system is sparse, symmetric and positive-definite. Diagonal and off-diagonal coefficients M CS 2 e,e and M CS 2 e,ep are, respectively, and the same M-property of the coefficients of the matrix of the system of CS1 holds (see section 3.2). The e-th coefficient of the source term vector is The solution of system (44)-(45) is performed in the same way as for system (31)-(32). In the present case too, matrix coefficients only depend on the geometrical variables and time step size, so that the system matrix is factorized only once before the time iteration loop starts, saving a lot of computational time.
We call Because the fluxes in the brackets of Equation (47) are continuous, mass conservation is satisfied along all the faces of each tetrahedron and inside each tetrahedron.
The pressure gradient at time level t k+1 is finally computed from Equation (37) as Observe, from Equation (38) and from the definition of u k+2/3 e,RT0 ∈ P 1,e given in Equation (40), that the gradient ∇η and η have respectively a linear and a quadratic variation inside each element, while the pressure gradient and the pressure correction in Equation (48) Equation (49) says that, inside the computational domain, the gradient of the function η is different from the gradient of the kinematic pressure correction k+1 e − k e . Integration of the divergence of both members of Equation (49) would allow us to compute the pressure at the element circumcenters. On the other hand, the kinematic pressure does not need to be known inside the domain in order to proceed to the solution of the next time steps. In section 5.2 we will show how to estimate the kinematic pressure at the computational nodes only at a given number of simulation times. If the velocity is known at the circumcenter of a boundary face belonging to u , we can set Fl k+2/3 j,e equal to the corresponding flux, include it in the r.h.s. of Equation (44) and set the corresponding flux Fl η j,e equal to zero. Observe that the tangent components of the velocity computed by Equation (47) are not the same as those used along the u boundary surface to compute the viscous boundary forces in the previous CS1 correction step. This implies a small difference between the computed boundary face and boundary element velocities.
At the circumcenter of the faces belonging to σ , where the hydrostatic stress boundary condition is set, the following Dirichlet type condition is finally assigned to the corresponding equations After computation of ∇ k+1 e , the sum S k+1 ,e of the kinematic pressure forces over the four faces of tetrahedron e, can be computed applying the Green lemma as ,e is assumed equal to S k ,e in the MAST PS in the next time iteration (see Equation (16) in section 3.1).
Unfortunately, even if in the 2D case it is always possible to get a mesh satisfying the EDP as defined in Equation (11), also for very irregular domain geometries (Aricò et al., 2011;Aricò & Tucciarelli, 2013;Forsyth, 1991, and cited references), in 3D space it is almost impossible to obtain a mesh satisfying the EDP and a 3D Delaunay mesh always has very irregular elements inside it (e.g. slivers, caps, skinny tetrahedrons, . . . ), which could affect the stability of the numerical solution (Joe, 1986). This implies the need to extend the methodology presented in all section 3 to the more general case of non-Delaunay meshes.

The MAST-RT0 pseudo code in the case of Delaunay meshes
(1) Compute model constants, including CS1 and CS2 matrix coefficients by means of Equations (32) (16) and (18)

Tetrahedron clusters
The discretization of the second-order derivative terms in CS1 and CS2 problems (i.e. Equations (31) and (44)) along a face of the computational mesh shared by two tetrahedrons with negative distance, as defined in Equations (25) and (33) (45)), so that the diagonal dominance and the M-property of the matrix system could be lost (Forsyth, 1991;Joe, 1986;Letniowski, 1992). Moreover, unphysical numerical solutions could arise, corresponding to poorly oriented viscous forces and pressure gradients. To avoid this problem, we propose the procedure described in the present section to handle the PS, CS1 and CS2 steps in the case of non-Delaunay meshes.
Let us consider irregular a face shared by two tetrahedrons with a negative distance d e,ep between the corresponding circumcenters, as defined in Equation (25). If one or more irregular faces are present in the mesh, the EDP condition is no longer satisfied (see section 2.2). We group all the tetrahedrons in clusters. A cluster is to be seen as a small non-empty group of neighboring tetrahedrons not sharing any irregular face with other clusters. For example, the two tetrahedrons in Figure 2(b) form a cluster. Each tetrahedron belongs to a single cluster. A single tetrahedron e forms a cluster by itself if it has no irregular faces. In the cluster, we distinguish the external faces, shared by two tetrahedrons of different clusters, and the other internal ones. According to the previous assumptions, all external faces are regular, and in a cluster composed of a single tetrahedron we do not have internal faces.
Let N C be the number of clusters, with N C ≤ N T . The general strategy is to write, instead of the dynamic equilibrium of each single tetrahedron, the dynamic equilibrium of each cluster as a function of a single velocity variation and finally to correct all the tetrahedron velocities inside the clusters, after the CS2 correction step, in order to guarantee flux continuity through all the faces.
From now on, indices m and mp refer to two neighboring clusters, N T ,m is the number of tetrahedrons e belonging to the m-th cluster, N ext f ,m and N int f ,m are the number of external and internal faces of the cluster m, l and r are the local counters of the external and internal faces of the cluster, respectively (l = 1, . . . , N ext f ,m , r = 1, . . . , N int f ,m ), and W m is the volume of the cluster, W m = e=1,N T,m W e (see also Figure 6).

The PS and CS1 problems for non-Delaunay meshes
Solution of the MAST prediction step, as explained in section 3.1, is not affected by the existence of irregular faces. On the other hand, after solution of all tetrahedrons, we need to evaluate, for each cluster, a single velocity variation corresponding to the cluster dynamic equilibrium. Call up k+1/3 e ∈ P 0,e the predicted tetrahedron velocity computed at time t k+1/3 as described in section 3.1 for u k+1/3 e in the case of Delaunay meshes, and ũ m the unknown velocity variation between time levels t k+1/3 and t k to be assigned to cluster m.
For each tetrahedron e of cluster m, we can write the correction velocity up k+1/3 e − u k e ∈ P 0,e by summing the time integrals of the ODEs system (16) and (18) (53) This implies that, summing the Equation (52) where all the momentum fluxes belonging to internal faces as well as the corresponding viscous forces, sum zero and can be deleted. Equation (54) can be seen as the equilibrium equation of the cluster m, which can be approximated by setting In the CS1 step, solution of Equation (31) in not EDP meshes is hindered by the negative distances d e,ep holding between the two circumcenters of tetrahedrons sharing an irregular face. To circumvent the problem, we write the viscous forces equilibrium of the cluster, where the external forces act always on regular external faces. Following the same procedure applied in section 3.2, integrating in space and time Equation (7b) and applying the Green lemma, we get the following system, where ep is the tetrahedron of cluster r, sharing with tetrahedron e the l-th external face of cluster m, with area σ m l , distance d e,ep is defined in Equation (25), and the other symbols have been previously specified.
We set, for the tetrahedrons of cluster m We deal with the BCs of the CS1 problem as described in section 3.2. Observe that the previously described procedure fails to guarantee in the clusters a positive distance from a boundary element circumcenter and the circumcenter of its boundary face. On the other hand, to guarantee a positive distance from the boundary we can simply avoid any internal node with a distance from the circumcenter of each triangular boundary face smaller than the radius of the circle passing through the three nodes of the same boundary triangle. This can be easily done with a methodology that will be shown in section 5.1. After solution of system (59), the velocity in the tetrahedrons e of each cluster m is updated according to Equation (58). At time level t k+2/3 we need to compute for each tetrahedron e the sum FV k+2/3 e of the viscous forces over its four faces, including the irregular ones, needed for the next computational time step. To do that we assume, coherently with the approximation of Equation (58) where the symbols of the momentum fluxes are the same used for the r.h.s of Equation (52). As it happens for the CS1 problem in the EDP meshes (section 3.2), at time level t k+2/3 , the continuity of the fluxes crossing the same face of two neighboring elements e and ep belonging to two different clusters, is not yet restored, u k+2/3 e ∈ P 0,e , and mass conservation is not satisfied.

The CS2 problem for non-Delaunay meshes
In non-Delaunay meshes the solution of the CS2 problem is split into two sub-steps. In the first sub-step we apply the procedure described in section 3.3, including in the mass balance the fluxes crossing the external faces of the cluster instead of the fluxes of the four faces of the single tetrahedron (as in Equations (43) and (44) The solution of the system formed by Equations (62)-(63) guarantees flux continuity on the external faces and global mass conservation inside the cluster, but because the velocities u k+1 e inside each cluster do not generally belong to a single RT0 space, it does not guarantee zero divergence condition inside the cluster, unless the cluster includes only one tetrahedron.
For clusters composed of more than one tetrahedron, we apply the following procedure. We change, in Equation (47) (64) where Fl int j,e is the flux crossing face j of element e, which is internal to the cluster, andφ j is 1 if face j is internal to the cluster, and 0 if it is external. We also assume and we look for the optimal set of internal fluxes that: (1) guarantees the condition u k+1 e ∈ P 0,e for all the elements of the cluster and, as a consequence of constraint (65), mass conservation too, (2) minimizes the kinetic energy inside the cluster.
To get the required Fl int j,e fluxes, in the second substep of the CS2 problem, for each cluster we solve the following linearly constrained quadratic minimization problem, subject to Equation (65) and where functional is the total kinetic energy of the cluster, given by the sum of the kinetic energy of the single tetrahedrons inside it, and the term in the square brackets in Equation (66) is the velocity u k+1 e , according to Equation (64). The last mass conservation Equation (67) in tetrahedron N T,m has been skipped because solving Equation (62) guarantees global mass conservation inside the cluster and this implies that any Equation (67) can be written as a linear combination of all the other ones. A very efficient way to solve problem (66)-(67) is to compute the Lagrangian multipliers, along with the required unknowns, as the solution of the following unconstrained quadratic minimization problem, Observe that Equation (69b) represent the mass conservation equations for all the tetrahedrons of cluster m. Moreover, if N int f ,m = N T,m -1 (see the 2D sketch in Figure 6(a)), system (69b) can be solved independently of Equation (69a) and there is only one set of internal fluxes that satisfy the mass conservation equations.
If N int f ,m ≥ N T,m (see the 2D sketch in Figure 6(b)), we also have to compute Equation (69a) along with Equation (69b), because we need to select, among all the sets of internal fluxes that satisfy mass conservation, the one that minimizes the kinetic energy within the cluster.
Equation system (69) has no special structure and has to be solved with direct solvers, but it is small and only includes a few tetrahedrons of the mesh, so its solution requires a negligible computational burden.
After Equations (64)-(69) are solved, in non-Delaunay meshes we cannot compute the gradient ∇ k+1 e of each tetrahedron by applying Equation (48) with the computed u k+1 e velocity. The reason is that, after the first substep of the CS2 problem, the computed solution satisfies the momentum and continuity equations of the clusters (not of the single tetrahedrons) and only the fluxes crossing the external faces of the clusters are computed under these constraints. The fluxes crossing the internal faces of each cluster are computed, during the second sub-step of the CS2 problem, by satisfying the continuity equations only. For these reasons, we compute ∇ k+1 e by applying the following general procedure.
We look for a new velocity u c m common to all the tetrahedrons of the cluster. Let be a scalar functional, defined as where fl u,l is the flux per unitary area crossing the l-th external face of the cluster, n l is the unit outward vector normal to face l, and all the other symbols have already been defined. In the case of a cluster made up of a single tetrahedron, according to Equation (47)  The relationship of u c m with the pressure gradient ∇ k+1 e is given by Equation (48), where u k+1 e is replaced by u c m in the case of N T,m > 1. Observe that the second step of the CS2 problem does not change the fluxes crossing the external faces and that the pressure gradient ∇ k+1 e is not affected by the corrected fluxes of the internal faces. To compute u c m we minimize , which is equivalent to setting at zero the partial derivatives of the functional with respect to the three components of u c m . The size of the resulting system is (3 × 3), it does not have a special structure, and, as for system (69), is solved using direct solvers.
Once ∇ k+1 e is computed, we obtain the kinematic pressure force on each element of the cluster by setting in all the elements Vector S k+1 ,e is assumed equal to S k ,e in the MAST PS of the next time iteration (see Equation (16) in section 3.1).

The MAST-RT0 pseudo code in the case of non-Delaunay meshes
(1) Compute cluster geometry and model constants, including CS1 and CS2 matrix coefficients by means of Equations (60) and (63a). Perform matrix factorization and set k = 1 (2) Given u k and ∇ k at time t k , apply the MAST prediction step (PS) (i) Compute velocity variation u f e ( t) and u b e ( t) by solving Equations (16) and (18)

Construction of the computational mesh and preliminary model operations
As mentioned in section 3.3, in the 3D space it is almost impossible to obtain a mesh satisfying the EDP given in Equations (10)-(11) (Forsyth, 1991;Joe, 1986 and cited references). Some algorithms exist (e.g. Qhull in Matlab 3 ), which generate a 3D tetrahedral mesh forming a convex hull with arbitrary node location. Unfortunately, the generated mesh has very irregular elements inside it (e.g. slivers, caps, skinny tetrahedrons, . . . ), along with several distances d e,ep (see Equation (25)) strictly equal or close to zero. These irregularities always lead to instability and poor accuracy of the numerical solution (Letniowski, 1992). The computational mesh used by the present solver is created by an off-line procedure, using two open source mesh generators, Netgen (Schöberl, 1997) and Tetgen (Hang, 2015).
We discretize first the computational domain with tetrahedrons using the mesh generator Netgen. The tetrahedral elements of the output mesh are quite regular in shape and size, even if they do not satisfy the EDP. Netgen also allows us to change the size of the tetrahedrons to discretize the internal subdomains properly, with quite smooth transitions of element size. Netgen also allows 'user-friendly' handling of the boundary domain.
Let N NET be the number of nodes of the Netgen mesh. For each boundary triangle bt we compute the corresponding circumsphere bt whose diametral plane contains bt, and we check if any internal node (i.e. not belonging to boundary faces) is internal to bt . At the end of this operation, we remove all the internal nodes inside the circumsphere of the boundary triangles (usually 0.02-0.03% of N NET ). We call the number of the removed nodes N r . The ensemble of the (N NET -N r ) nodes is the input for the mesh generator TETGEN, which regenerates the mesh domain starting from the nodal positions of the (N NET -N r ) nodes, still preserving the input domain boundaries. In order to optimize some aspect-ratio and shape tetrahedrons conditions, Tetgen can also insert additional Steiner nodes (Hang, 2015). The present numerical solver uses as its input the Tetgen output mesh.
Preliminary model operations concern (1) generation of the topology of the tetrahedrons and clusters of tetrahedrons, saved in separate arrays, (2) calculation of the matrix coefficients of systems (59)-(60) and (62), and (3) their factorization before the time iterations loop start.

Computation of the kinematic pressure ψ and of the body forces
Let N be the number of nodes of the computational mesh. We approximate the function ψ according to a Galerkin Finite Element approach, where˜ i is the unknown nodal pressure value and w i is the Galerkin shape function in node i. Once ∇ k+1 e has been obtained inside each tetrahedron e as specified in sections 3.3 and 4.3, we minimize the following scalar functional, where (∇ x q k+1 ) e is the q-th components of ∇ k+1 in tetrahedron e, previously computed.
is a convex function and its minimum is obtained by setting at zero the partial derivatives of Equation (73b) with respect to the nodal˜ k+1 i values, Equation (74a) can be written, according to Equation (72), as Equation (74b) represents a linear system solved for the˜ l unknowns, with an (N x N) system matrix that is sparse, symmetric and positive-definite. Over the boundary faces of σ , we assign Dirichlet BCs for˜ l according to the prescribed boundary values. The diagonal and off-diagonal coefficients are and the i-th coefficient of the source term vector is We solve system (74)-(75) as the previous ones in sections 3.2, 3.3, or 4.2 and 4.3.
The forces over the bodies are easily computed as, where N bf is the number of triangular faces of the body, σ b and n b are the area of the b-th body face and the unit normal vector coming into the body, respectively,˜ f b is the value of the kinematic pressure at f b -th node of the body face, g b is the velocity vector assigned to the face, u k+1 e is the velocity vector in the tetrahedron e with the b-th body face, and d e j is the distance between the circumcenter of tetrahedron e and the circumcenter of the body face, computed as in Equation (33).

Test cases
The proposed numerical solver was applied to four different numerical tests. In the first test we analyzed the spatial and temporal accuracy of the model, as well as the required computational costs. In tests 2 and 3 we present two well-known literature applications, the lid driven cavity and the flow past a fixed sphere, according to different values of the Reynolds number. In the last test the real case of hemodynamic blood flow inside an abdominal aorta affected by an aneurysm is solved by the model inside a very irregular boundary.
For post-processing of the model outputs, we used the open source program Paraview. 4

Taylor-Green vortices test
We first analyzed the accuracy of the proposed solver by comparison of the computed results with the known analytical solution of the Taylor-Green vortex test (Taylor & Green, 1937). The velocity vector components and the pressure field are (Taylor & Green, 1937), u = cos(αx) sin(αy)e −βt v = − sin(αx) cos(αy)e −βt w = 0 such that the r.h.s. of Equation (1) where ∇ 0 e is computed from Equation (77) in the center of mass of tetrahedron e.
We discretize the domain using 5 meshes with mean element characteristic size h l (l = 1, . . . , 5) ranging from 0.00996 m to 0.066 m. h l is the mean value of the length of the sides of the tetrahedrons.
The discretized ICs u k e = u e,0 do not satisfy the momentum and continuity equations of each element at time t = 0, as u e,0 ∈ P 0,e , but the flux continuity through each face is missing. In order to circumvent this problem, before the beginning of the transient flow simulation with the BCs given by Equation (77), we compute the steadystate asymptotic solution corresponding to the BCs at t = 0 and we use it as ICs of the transient problem. We assume that the steady-state solution is attained when the L 2 norm of the relative scatters of the computed u, v, w, and ψ, compared with the ones of the previous iteration, are small and the following tolerance holds See in Tables 1 and 2 the L 2 and L ∞ norms of the relative errors of the steady-state x and y velocity components and kinematic pressure with respect to the values computed at the face circumcenters according to Equation (77), for meshes with different size. The error of the z velocity component is negligible compared to the x and y errors.
We also assume that the relative error err l , computed for the mesh with mean element size h l , is proportional to a power of h l , where r c,s is the spatial rate of convergence, obtained by comparing the relative errors of two sequential sizes h l and h l+1 as The rate of convergence is shown in Tables 1 and 2. The computed r c,s of the velocity components are slightly   greater than 1 (ranging from 1.14-1.22) due to the piecewise approximation of u and v inside each tetrahedron. The convergence rate obtained for the kinematic pressure is greater (ranging from 1.45-1.59) and the reason could be the nodal pressure distribution inside each tetrahedron, as described in section 5.2. In Figure 7 we plot the iso-contour lines, over a horizontal plane with z = 0.125 m, of the relative errors of the norm of u and of ψ obtained for the coarsest mesh. Close to the lateral boundary walls, the errors of the velocity decrease, due to the imposed BCs. On the opposite, the highest values of the relative errors of the kinematic pressure are close to the boundary lateral walls, since the Dirichlet BCs of ψ have been imposed over the upper and lower horizontal walls.
In Figure 8(a) we plot the norms of the relative errors, along with the 1st order convergence line.
With this test we also analyzed the time convergence rate of the algorithm. In order to cancel out the error due to the spatial discretization, we assumed as the reference solution the one obtained over the finest mesh (with mean size h l = 0.00996 m), and a time step size t = 0.001 s. At simulation time 2 s, we compared with the reference solution the numerical solutions obtained over the same mesh, assuming five different t values, ranging from 0.0015 s to 0.02 s. The L 2 and L ∞ norms of the relative errors are shown in Tables 3 and 4, along with the time rate of convergence, r c,t , computed by comparing the relative errors of two sequential time step sizes t l and t l+1 The rate of convergence r c,t is always greater than 1, ranging from 1.21-1.28, even if the model is 1st order accurate in time. This could be due to a twofold reason related to the MAST-PS, (1) the use of the internal time sub-grid during the ODEs solution, and (2) a polynomial time approximation order of the leaving momentum fluxes, using n G = 3 Gauss points. In Figure 8(b) we plot the norms of the relative errors along with the 1st-order convergence line.
For the finest mesh, for each time step adopted for the time convergence rate analysis, we computed the maximum CFL number, as CFL max = max t 3 √ W e ||u e || e = 1, . . . , N T (81) CFL max ranges from 2.56 (for t = 0.02) to 0.0128 (for the reference solution). We also investigated the computational (CPU) times required by the different model steps, using a single Intel Core i7 at 3.49 GHz. Because computational times strongly depends on the adopted computer and on the specific algorithm coding, we focused on the correlation existing between the the computational time of the single step and the number of elements. We set the average CPU time per iteration and per model step equal to and we assumed that a single step is efficiently solved as much as the β power exponent in the correlation 82(a) is small and close to 1. Equation (82a) in logarithmic space becomes ln (CPU step See in Figure 9 the CPU step time required for the solution of the single model steps, i.e. cell sorting (CPU S ), solution of the MAST-PS step (CPU MPS ), solution of the CS1 and CS2 steps (CPU CS1 and CPU CS2 , respectively), as well as the kinematic pressure computation (CPU ). MAST-PS is the most demanding one, but in this case the CPU is simply proportional to N T , and in Equation (82a) power β is equal to one. The CPU required by the other model steps grows in the logarithm space more than linearly with the number of tetrahedrons due to their 'non-explicit nature', since solution of large linear systems is involved, but β is smaller than 1.20 for the CS2 step, and smaller than 1.12 in all the other ones. The sorting cell operation is the least demanding algorithm step and its CPU time is 2-3 magnitude orders smaller than the MAST-PS one.

Lid driven cavity flow at different Reynolds numbers
In this test the flow is confined inside a square cavity, and it is driven by the upper wall displacement in horizontal direction. The set-up of the test is shown in Figure 10  where v max = 1 m/s. We run simulations for Re = 100, 400 and 1000.
The imposed horizontal velocity of the upper lid drives the fluid inside the cavity into a vortical flow. The resulting complex flow structure shows a large central vortex and small recirculating zones close to the cavity corners, whose shape depends on the value of Re.
We discretize the domain with 87,740 tetrahedrons and 17,388 nodes (see in Figure 10(b) an intersection of the mesh with a cutting plane), resulting in 84,385 clusters. The time step size t is set to 0.025 s, and the maximum computed CFL values are 2.18, 2.08 and 1.96 for Re = 100, 400 ad 1000, respectively. Figure 11. Test 2. Velocity streamlines (left panels), vorticity (ω z ) (central panels), iso (kinematic) pressure (right panels). Top Re = 100, middle Re = 400, bottom Re = 1000.
In Figure 11 we plot the streamlines, as well as the vorticity and the iso-pressure fields. The minimum and maximum pressures are computed respectively at the upper-left and upper-right corners, where a discontinuity in the boundary condition occurs. Observe that, because of this singularity, no ad-hoc handling is required in the model, unlike what is found in Botella and Peyret (1998), Boppana and Gajjar (2010), Kuhlmann and Romanò (2019), and cited references. The minimum pressure values are associated with the foci of the vortices, due to the high centrifugal acceleration occurring around them. We obtain good agreement with the literature results (e.g. Dalal et al., 2008 and cited references), and the results provided by the present solver match very well the ones of the 2D benchmark solutions given by Ghia et al. (1982) and shown in Figure 12. The results in Figures 11 and 12 refer to the cutting plane (x-z) with y = 0.1 m (i.e. the diametral plane of the domain). The results obtained for other cutting planes (x-z) are very similar to the previous ones, and for brevity are not shown here.
Observe that the value of the vertical velocity component computed for Re = 400 and x = 0.9063 provided by Ghia et al. (1982) (w Ghia = −0.23827) (see table 2 of the referred paper) is quite different from the result obtained by the present solver. On the other hand, the w Ghia result is missing in most of the papers where the lid-driven test is used as bench mark, including the papers reporting the solution by Ghia et al. (1982) for many other points (e.g. Xue & Burton, 2013 and many others), whereas in other papers the mentioned w Ghia vertical velocity component does not match the result, like in Dalal et al. (2008).

Flow past a stationary sphere at different Reynolds numbers
In the past few decades, a plethora of experimental, theoretical and numerical studies of viscous flow past a stationary sphere S have been presented to investigate wake structures. The Reynolds number, defined as based on the uniform undisturbed flow velocity U 0 , on the diameter of the sphere D s , and on the surrounding fluid viscosity ν, is used as a parameter to classify the wake structure (e.g. Johnson & Patel, 1999;Ploumhans et al., 2002;Sakamoto & Haniu, 1990, and cited references). The wake structure has been a strongly debated topic, and several controversial findings have been obtained by authors in the literature studies (e.g. Johnson & Patel, 1999;Ploumhans et al., 2002;Sakamoto & Haniu, 1990, and cited references). According to experimental and numerical studies, above 270 < Re < 290, the flow becomes unsteady but periodic, and vortex shedding starts around Re = 300 (e.g. Johnson & Patel, 1999;   Ploumhans et al., 2002;Sakamoto & Haniu, 1990, and cited references), with formation of hairpin vortices (as shown in Figure 13, from the experimental studies by Sakamoto and Haniu (1990)). By progressively increasing Re, the vortices start to intertwine with each other, and above Re = 500 periodicity is lost (e.g. Johnson & Patel, 1999;Ploumhans et al., 2002;Sakamoto & Haniu, 1990, and cited references). For a more comprehensive review, we refer the readers to the cited works.
In the research referred to here we investigated the flow structures around a stationary sphere for Re = 300 and 600. The fluid is assumed to be water. In Figure  14 we plot the 3D view of the domain and the setup of the numerical test. We assume the sphere S, with D s = 0.0254 m, symmetrically placed inside a large cylinder C with diameter D C = 0.3 m ( 12 D s ) and length 1.1 m ( 43.5 D s ). The center of S is located 0.15 m ( 6 D s ) downstream of the inflow upstream face of C. At the upstream inflow section of C we set u uniform and constant in time (U 0 , 0, 0), and the same velocity is imposed at the lateral walls of C in such a way that the flow around S is only weakly affected by the walls. Zero pressure is assumed at the downstream outflow section. Over the surface of S we assume no-slip BC. We assume flow at rest and zero pressure inside C at t = 0.
The mesh size is refined in zone I (around S and downstream of it, as shown in Figure 14(a)) in order to reproduce the strong velocity gradients close to the surface of S and the fluid vortices in the wake. A larger mesh size is adopted for the rest of the domain (zone II in Figure 14(a)). The mesh size (defined as in test 1) used for zones I and II is 0.00055 and 0.0128 m, respectively, with a smooth transition between the two zones. The total number of tetrahedrons in the mesh is 2,309,771, with 2,243,199 clusters and 392,850 nodes, and the surface of S is discretized with 4332 triangles.
Unfortunately, description of initiation of vortex shedding is most often ignored in numerical studies. In physical experiments, initiation of vortex shedding is generated by flow instabilities amplifying small flow disturbances (Sungsu, 2000). Such disturbances include, among others, asymmetric domain geometry, vibrations of the pipe, non-uniformity and turbulence of the inflow velocity, non-uniform roughness of the pipe walls and sphere surface, . . . (Sungsu, 2000). All these sources are missing in numerical experiments, and, for a stationary sphere inside a symmetric domain, symmetric steady solution are attained even for Reynolds numbers at which experimental unsteady vortex shedding has been detected (Sungsu, 2000). In numerical experiments, vortex shedding could be generated by (1) computational truncation and round-off errors, strongly dependent on the characteristics of the numerical solver and the computer, or (2) specifically introduced numerical perturbations (Sungsu, 2000). Examples of such numerical perturbations can be found, for example, in Ploumhans et al. (2002), Sungsu (2000) and cited references.
We performed a first series of simulations at Re = 300 without numerical perturbations, with an impulsive start of the inflow velocity. After rapid changes during the early stages of the process, the flow characteristics converted to a stationary solution. In Figure 15 we plot the streamlines of the stationary flow field in the (x-y) plane. After the early transient process due to the impulsive flow start, we computed the stationary values of drag and lift coefficients C D and C L listed in Table 5. These were obtained as C D = F x /(1/2ρU 0 πD 2 S /4) and where F x and F y are the x and y components of the total force, sum of the pressure and the viscous forces. The streamlines in the (x-z) plane are almost symmetrical, and the mean-intime value of the side coefficient C S , obtained as C S = F z /(1/2ρU 0 π D 2 S /4), is 1.34e-05. It is important to underline that C D and C L are in very good agreement with the mean in time values provided by literature studies, which are reported in Table 5.
Similarly to Ploumhans et al. (2002), the most efficient method for the present solver to trigger vortices has been, after the impulsive flow start, to set the y velocity component in the interval 3 ≤ τ * ≤ 4, along the inflow section and the lateral walls of C. The time step size used for the simulations is 0.05 s and the maximum computed value of the CFL number is 3.5.
In Figure 16 we plot the 3D view of the vorticity structures identified by the Q-criterium (Hunt et al., 1988). In Figures 17 and 18 we show the velocity streamlines and the iso-contour lines of the kinematic pressure in the (x-y) and (x-z) planes, respectively. The time difference among the panels is one quarter of a period, and after the fourth panel (3/4 of period), the cycle repeats again.   The present model satisfactorily reproduces the hairpin shapes of the vortical structure. Observe that, due to the strong pressure gradients feeding the movement of the vortices, the pressure minima in this case do not match their foci. After a rapid transient phase, due to the impulsive start and to the imposed numerical perturbation, the time evolution of the C D and C L coefficients becomes periodic in time. The mean and amplitude values are listed in Table 5, and compared with the results provided by other literature works.
We also simulate the case with Re = 600. According to experimental observations, the shedding vortices become irregular for Re > 480 (Sakamoto & Haniu, 1990). In Figure 13(c) we have previously shown the pattern of vortex shedding for 480 < Re < 800 (from Sakamoto & Haniu, 1990). The setting of the ICs and BCs is the same as for Re = 300, as is the impulsive flow start. In this case, perturbation for generating vortex shedding was not necessary. The time step size used for the numerical runs was 0.025 s and the maximum value of the CFL number attained during the simulations was 3.12.
In Figure 19(a) we show the 3D vortical structure at τ * = 35, where τ * is defined in Equation (85) and the irregularity of the hairpin vortices and their intertwining with each other is evident, as experimentally observed by Sakamoto and Haniu (1990). In Figure 19(b) we plot the time histories of the drag lift and side coefficients. As expected, we lost the periodic trend of C D and C L observed for Re = 300, and the value of the side coefficient is of the same order as C L . As mentioned in section 3.1.1, for the case of Re = 300, we estimate N r,k and predict the corresponding computational time of parallelization of the MAST algorithm assuming different N p number of processors. The maximum N r,k is 3847. Neglecting the time and the other parallelization costs, the MAST solution time T MAST would be equal to where N iter is the number of time steps and N R,k is the number of ranks at iteration k. This time is of course larger than the time T min computed assuming all available processors working together, because N r,i can be smaller than N p and the ratio between T MAST and T min grows along with N p . See in Table 6 the ratios T MAST / T T and T MAST / T min computed in the simulation of test 3, with a mesh of 2,309,771 tetrahedrons, assuming a number of processors in the range 4-250, at the present time corresponding to small-medium workstations. You can observe that the ratio T MAST / T min attains a maximum value of 1.4 with   the maximum number of 250 processors. This means that parallelization should work very well with this type of very popular computers even with large size problems.

Simulation of blood flow inside aneurism
In this test, we simulated the hemodynamic flow conditions inside a real abdominal aorta affected by a large aneurysm without thrombus in the lumen. The computational domain was computed starting from the kinematic field of a real (44-year-old female) patientspecific aortic wall, obtained from the data recorded by an electrocardiogram-gated computer tomography angiography (CTA) during a stabilized cardiac cycle, as described in (Aricò et al., 2020 and cited references). Besides the CTA images, additional input data were the measurements, in a resting condition, of pressure (on the left arm) and aorta volumetric flow rate (in the carotid artery), during a stabilized cardiac cycle (Aricò et al., 2020). The cardiac cycle T c of the patient was 0.83333 s.
The real aortic segment was approximately 0.16 m long (see Figure 20(a)). The computational domain was extended with respect to the real one by means of two transition stretches, and the real cross-sections were linearly morphed into circles of equivalent radii r = √ A/π (Figure 20(a)). The diameters of the inflow and outflow artificial sections, D i and D o , are 0.03213 and 0.0256 m, respectively, and the two stretches were approximately D i long.
For the numerical model simulations, we set an inflow velocity profile and a uniform spatial pressure distribution along the upstream and downstream sections of the computational domain, respectively. The BCs of the present model were obtained as described in Aricò et al. (2020) and cited references.   In Figure 21(a) we plot the waveforms of the inflow velocity (obtained by dividing the waveform of the flow rate by the area of the inflow cross-section), and the outlet pressure. The 'time of the diastolic (systolic) pressure' is the time corresponding to the minimal (maximal) aortic pressuretdp (tsp) in Figure 21. tdp and tsp were computed to be 0.0589 and 0.2946 s after the start of the cycle (Nagy et al., 2015).
A Womersley velocity profile of the pulsatile flow (Womersley, 1955) was analytically computed for the diameter D i , as described in Aricò et al. (2020), and assigned to the upstream boundary of the computational domain. The Womersley number α can be regarded as the ratio between the unsteady inertial forces and the viscous forces, and it is defined as (Womersley, 1955) where R is the radius of the vessel at the boundary section. We set ν = 3.77 × 10 −6 m 2 /s (blood kinematic viscosity). The original Poiseuille velocity profile is flattened proportionally to the α number. In the present case α is around 24, and Figure 21(b) shows the profiles computed for the significant times listed in the table of Figure 21(a).
The maximum value of the Reynolds number attained during the simulations was approximately 1200, which implies a fully laminar flow.
The computational domain corresponds to the fixedin-time geometry computed, at the tsp time, applying the procedure proposed in Nagy et al. (2015). The computational mesh had 790,346 tetrahedrons and 146,995 nodes, and the number of cluster is 729,038. The mesh size ranged from 1.e-04 m to 1e-03 m. In Figure 20(b,c) we show the mesh and a zoom of a cutting generic plane. The time step size was 0.01 s and the maximum value of the CFL number computed during the simulations was 3.45.
In Figure 22, for the significant times listed in Figure  21(b), we show the computed velocity and the kinematic pressure fields. The black arrow indicates the main upstream-downstream flow direction. At tdp, the pressure gradient is oriented according to the main upstream-downstream direction. The velocity profile in the upstream portion of the studied reach is almost uniform along the radial vessel direction, and, close to the walls, recirculating flow zones arise (see the zoom in Figure 23). These recirculating flows could be generated by the inflow velocity computed in the most lateral part  of the Womersley profile (see Figure 21(b)). The blood flow decelerates in the central region of the aorta, due to the enlargement of the vessel because of the aneurysm, and accelerates downstream of the aneurysm, due to the reduction of the section of the vessel. Due to the reduction of the inflow velocity assigned at the inflow section, from time 0.25 T c to tsp, and the corresponding increase of the assigned outlet pressure in the same time interval (see Figure 21), the pressure gradient along the principal flow direction changes sign, becoming downstreamupstream oriented. The recirculating flow velocity zones close to the lateral walls then disappear, since the boundary Womersley velocities assigned at the inflow section are inward oriented. At time 0.54 T c , the flow recirculation close to the vessel walls is stronger than at tdp. The size of the portion of the inflow section where the leaving Womersley velocities are set is similar to the one where the incoming velocities are assigned, and their norm is comparable to (or greater than) the values of the inflow velocity norm (see Figure 21(b)). Vorticities also arise in the central and downstream portions of the aortic vessel, as shown in the zoom of the velocity vector ( Figure 22).
In Figure 24 we plot the values of the velocity components computed along the axis of the aneurism, shown in Figure 25, at four different times. The origin of the coordinates along the axis is also shown in Figure 25. These results are marked as 'u (v, w) m1'. In Figure 24, we also superimpose the results obtained using a refined mesh, with 3,361,925 tetrahedrons, 599,988 nodes, and 3,116,998 clusters, marked as 'u (v, w) m2'. The scatters between the two solutions are almost negligible.

Conclusions
A new algorithm for the numerical solution of the 3D Navier-Stokes equations for incompressible fluids has been presented and validated with synthetic tests. The algorithm is radically new and is based on the The general strategy is to compute the mth tetrahedron of IORD and its rank RANK(IORD(m)) after the computation of the previous tetrahedrons IORD(1), . . . , IORD (m-1) and their rank. Order adopts the following subroutines, where apex i marks input variables and apex o marks output variables:

A1.1. Subroutine Switch
Input: nx, AUX, IORD, INV, FL, ET, RANK Output: RANK, IORD, INV Given the known index r = AUX(nx) of a tetrahedron which satisfies constraints A1 and has RANK i (r) = 0, compute the new rank of r as the maximum rank of the neighbor tetrahedrons that satisfy constraint (A1), plus one. This allows tetrahedron r to satisfy also requirement A2. Switch the position of tetrahedrons r and IORD i (m) in the

A1.2. Subroutine Search
Input: nx, AUX, IORD, INV, FL, ET, RANK, BACK, JC Output: nx, AUX, BACK, RANK, IORD, INV, JC Call BACK(k) the neighbor of tetrahedron k with RANK(k) = 0 and RANK(BACK(k)) = 0 with the maximum flux going from to BACK(k) to k. Given a length nx i ≥ 1, check if constraint (A1) is satisfied for k = AUX(nx). If it is satisfied, apply Switch and set nx o = nx i -1. Otherwise, select the neighbor tetrahedron kp with the minimum (maximum absolute value) entering flux smaller than zero and with RANK i (kp) = 0. If BACK(kp) = 0, we have a loop. In this case apply subroutine Cut and iterate the check. If constraint (A1) is not satisfied, select the neighbor tetrahedron kp with RANK(kp) = 0 and minimum entering flux, set BACK(k) = kp, update nx o with

A1.3. Subroutine Cut
Input: nx, AUX, BACK, JC, FL, ET Output: nx, BACK, JC If a loop is found, compute the index mb and the tetrahedron kb = AUX i (mb) corresponding to the minimum positive flux going from BACK (AUX i (m)) to AUX i (m) for mp ≤ m ≤ nx i , where AUX i (mp) = BACK (AUX i (nx i )). Set nx o = mb and BACK o (AUX i (m)) = 0, for mb ≤ m ≤ nx i . Set ka = BACK0 i (kb), JC o (n1,ka) = kb and JC o (n2,kb) = ka, where n1 and n2 are the local indices of the face common to tetrahedrons ka and kb. See the flow-chart in Figure A3.
In Order we loop the index p from 1 to nel. At each iteration, if nx > 0 we apply Search. If nx = 0, we test the tetrahedron k = IORD(p). If constraint (A1) is satisfied, we apply Switch. If it is not satisfied, we set nx = 1, AUX(nx) = k and apply Search. See the flow-chart in Figure A4.
Observe that, if zero loops and zero flux sign changes occur, the solution obtained in the MAST procedure is the same with any adopted sorting rule providing a sequence vector IORD also different from the vector computed in the Order subroutine, if constraint (A3) is satisfied for all the tetrahedrons. Due to loop cuts and flux sign changes, this is not true and a much better solution turns out to be the one obtained by solving sequentially all the tetrahedrons with the same rank, starting from 1, also using parallel computing if available. Any open source subroutine, like QUICKSORT, can be used to order all the tetrahedrons according to their rank value after the RANK solution of Order is found.
See the flow-chart of the Order, Search, Cut and Switch subroutines in Figures A1-A4.