High-Order Moment-Encoded Kinetic Simulation of Turbulent Flows

Kinetic solvers for incompressible fluid simulation were designed to run efficiently on massively parallel architectures such as GPUs. While these lattice Boltzmann solvers have recently proven much faster and more accurate than the macroscopic Navier-Stokes-based solvers traditionally used in graphics, it systematically comes at the price of a very large memory requirement: a mesoscopic discretization of statistical mechanics requires over an order of magnitude more variables per grid node than most fluid solvers in graphics. In order to open up kinetic simulation to gaming and simulation software packages on commodity hardware, we propose a HighOrder Moment-Encoded Lattice-Boltzmann-Method solver which we coined HOME-LBM, requiring only the storage of a few moments per grid node, with little to no loss of accuracy in the typical simulation scenarios encountered in graphics. We show that our lightweight and lightspeed fluid solver requires three times less memory and runs ten times faster than state-of-the-art kinetic solvers, for a nearly-identical visual output.

We contribute a new kinetic solver for incompressible and single-phase fluid simulation interacting with complex solids.By relying on only three moments per grid node instead of the usual 27-direction velocity discretization, we reduce memory usage threefold and improve efficiency by an order of magnitude, with limited impact on accuracy.As a consequence, a high-quality visual simulation of a tractor in a wind tunnel (with passively-advected colored particles to visualize the flow) can be achieved efficiently on a commodity GPU.
Kinetic solvers for incompressible fluid simulation were designed to run efficiently on massively parallel architectures such as GPUs.While these lattice Boltzmann solvers have recently proven much faster and more accurate than the macroscopic Navier-Stokes-based solvers traditionally used in graphics, it systematically comes at the price of a very large memory requirement: a mesoscopic discretization of statistical mechanics requires over an order of magnitude more variables per grid node than most fluid solvers in graphics.In order to open up kinetic simulation to gaming and simulation software packages on commodity hardware, we propose a High-Order Moment-Encoded Lattice-Boltzmann-Method solver which we coined HOME-LBM, requiring only the storage of a few moments per grid node, with little to no loss of accuracy in the typical simulation scenarios encountered in graphics.We show that our lightweight and lightspeed fluid solver requires three times less memory and runs ten times faster than state-of-the-art kinetic solvers, for a nearly-identical visual output.

INTRODUCTION
Lattice Boltzmann method (LBM) has become increasingly popular for the visual simulation of turbulent incompressible flows [Li et al. 2003;Thürey and Rüde 2009;Li et al. 2020;Lyu et al. 2021;Li et al. 2022].Contrary to most early fluid solvers used in computer graphics [Foster and Metaxas 1996;Stam 1999;Mullen et al. 2009;Lentine et al. 2010] which directly seek to integrate the Navier-Stokes equations, LBM is based instead on a kinetic formulation of the flow derived from statistical mechanics: a mesoscopic description of the fluid in time, space, and velocity components is advanced in time through linear streaming (to handle advection) followed by collision (performing a local relaxation towards kinetic equilibrium).The massively-parallel nature of LBM has long been its most attractive feature, as it naturally runs on modern GPUs [Li et al. 2003;Chen et al. 2022] to offer unparalleled computational efficiency compared to conventional solvers.Moreover, the development of accurate collision models has recently led to more accurate results with less dissipation and dispersion artifacts for complex flow behaviors [Li et al. 2020], in particular for low-viscosity turbulent fluids.
However, all LBM methods in graphics face a steep memory requirement: for a 27-direction velocity discretization as typically needed for simulations of high Reynolds numbers in 3D, LBM solvers often require at the very least 58 values per node (two sets of 27direction distribution values for time integration and the first two macroscopic moments (density and linear momentum, plus temporary memory usage for conversion to moments), which prevents the use of LBM on small GPU platforms.Although recent methods using efficient swap operations can save 27 variables per node through a more involved streaming process, the stubbornly-high memory size for large simulations not only demands high-end GPUs, but also hampers efficiency by requiring intensive memory access.
In this paper, we introduce a lightweight LBM solver which significantly cuts down on memory size and improves efficiency compared to previous methods, without incurring a noticeable drop in accuracy for typical visual flow simulations.Our approach borrows from recent advances on moment-representation LBM (MR-LBM) to bypass the use of a fine directional velocity sampling, while also remedying their compressibility issues at high Reynolds numbers due to their low-order collision model.In addition, we also propose a new moment-based boundary treatment for fast two-way coupling.We demonstrate that the resulting lightspeed and lightweight LBM solver ends up being three times smaller in memory footprint and an order of magnitude faster than [Li et al. 2020] for the same resolution and a nearly-identical visual output.

RELATED WORK AND MOTIVATION
While fluid simulation with fluid-solid coupling has been achieved in both computer graphics (CG) and computational fluid dynamics (CFD) in a variety of ways, we briefly review the most related works to our contributions to single-phase flow simulation.
The ability to support fluid-solid coupling is also crucial to simulate complex fluid-solid interactions.Early coupling works included voxelized boundary approximations [Takahashi et al. 2002;Génevaux et al. 2003;Robinson-Mosher et al. 2008;Azevedo et al. 2016], a rigid-fluid approach [Carlson et al. 2004], a fully-Eulerian method [Teng et al. 2016], or even coupling that can treat infinitesimally thin and deformable solids [Guendelman et al. 2005] by enforcing proper solid-boundary conditions to keep the fluid incompressible and get the correct interaction force between fluid and solids.Cut-cell-based approaches [Roble et al. 2005;Batty et al. 2007;Ng et al. 2009;Gibou and Min 2012;Weber et al. 2015;Edwards and Bridson 2014;Azevedo et al. 2016;Tao et al. 2022] have also been formulated to efficiently deal with thin structures and complex geometry by subdividing them into boundary-conforming regions.However, none of these methods have demonstrated stable coupling in very turbulent flows.Particle-based solvers [Colagrossi and Landrini 2003;Solenthaler and Pajarola 2008;Akinci et al. 2012;Schechter and Bridson 2012;de Goes et al. 2015;Bender and Koschier 2016;Band et al. 2017;Koschier and Bender 2017;Becker et al. 2009;Ihmsen et al. 2013;Fang et al. 2020] typically approximate solid boundaries with a dense set of particles, but often face pressure inconsistencies [Band et al. 2018], explaining why hybrid grid-particle solvers [Zhang et al. 2016;Fei et al. 2018;Hu et al. 2018;Fei et al. 2021] usually treat boundary conditions via grid-based methods.

Boltzmann-based fluid solvers
LBM has recently been proven a great alternative to traditional incompressible Navier-Stokes based solvers for both single [Li et al. 2020;Lyu et al. 2021] and multiphase [Li et al. 2021[Li et al. , 2022;;Li and Desbrun 2023] fluid simulation.While early works [Li et al. 2003;Thürey and Rüde 2009] were based on the low-order BGK collision model [Chen and Doolen 1998], LBM solvers in graphics significantly improved their accuracy for turbulent flows by constructing higher-order collision models.For instance, Li et al. [2020] proposed a central-moment multiple-relaxation-time (MRT) collision model where non-physical rates were optimized for reduced dispersion and dissipation.Moreover, in order to improve computational efficiency on GPU [Li et al. 2003], Chen et al. [2022] and Lehmann [2022] provided a number of detailed strategies to enable improved parallelism.As a result, current methods have reached impressive capabilities ACM Trans. Graph.,Vol. 42,No. 6  to simulate both laminar and turbulent flows, with both numerical accuracy and efficiency surpassing even the most recent Navier-Stokes solvers.Fluid/structure coupling was also tackled in previous LBM works, using the immersed boundary method (IBM [Peskin 1972]) or the moving bounce-back boundary method (BBM [Ladd 1994]); Lyu et al. [2021] also suggested a velocity correction to allow for thin obstacles and suppress the usual ringing artifacts of BBM.

Motivation and overview
The recent upending of Navier-Stokes fluid solvers due to the emergence of high-order accurate LBM solvers has not yet permeated the realm of video games or simulation software packages using commodity hardware, because all current kinetic methods face a huge memory cost: since a D3Q27 model (i.e., a discretization of velocities in 27 directions) is needed to reliably handle turbulent flows, a typical implementation requires a minimum of 58 distribution values per grid node [Li et al. 2020], including two copies of the 27-direction distribution values for time integration and the important macroscopic quantities (density and velocity).While various streaming strategies can remove up to 27 variables per node [Latt 2007], this still is eight to ten times more than a typical Navier-Stokes solver, requiring high-end GPUs for detailed simulations with fine grids.Very recently, Vardhan et al. [2019] and Ferrari et al. [2023] proposed encoding a 27-direction distribution via only its few first moments in order to replace the memory-hungry distribution-based representation.This moment-representation LBM (MR-LBM) only uses 10 variables per node, greatly reducing memory size requirements and inducing a performance uplift between 25% and 40% on GPU due to far reduced memory access.However, this improvement in efficiency and memory size comes with a severe limitation: this lightweight approach can no longer support turbulent flows ( ≥ 4000) reliably, due to limited accuracy.
This paper leverages the moment-based representation of LBM initially proposed in [Vardhan et al. 2019;Ferrari et al. 2023], while addressing its current limitations.We show that one can use a lightweight moment representation to replace the large storage of direction-based distributions used in current LBM methods in graphics while enforcing a high-order accurate collision at low computational cost.We also formulate a new fluid-solid coupling scheme based on this moment-based LBM approach to generate complex two-way phenomena.Reduced memory usage along with fast moment-based collision and coupling result in a lightweight and lightspeed kinetic solver with limited accuracy loss compared to state-of-the-art LBM solvers.

BACKGROUND
Before introducing our new moment-based LBM scheme, we briefly review current single-phase LBM methods in graphics, and introduce existing moment-based representation LBM methods, while pointing out their stability and accuracy issues.

Lattice Boltzmann method at a glance
Continuous Boltzmann equation.In statistical mechanics, fluid dynamics is described by the time evolution of a mesoscopic distribution function  (, , ) representing the probability of particle being at position  at time  and with velocity .This is in sharp contrast to typical NS-based method modeling the flow dynamics with the time evolution of a macroscopic velocity  (, ).In the context of single-phase fluid dynamics, the governing kinetic equation for the evolution of the distribution function is known as the Boltzmann equation [Shan et al. 2006]: where  represents external forces and Ω is the collision term which relaxes the distribution function towards the local thermodynamic equilibrium state   [Coreixas et al. 2017] defined as where  is a gas constant,  is the thermodynamic temperature and  is the spatial dimension (2, or 3).The macroscopic quantities such as density , linear momentum , and the rank-two tensor  related to the stress tensor can be recovered from the distribution through: where the greek indices  and  refer to tensor coordinates, i.e.,  = {   } , for ,  ∈ {,,}.
The original lattice Boltzmann method (LBM) used the Bhatnagar-Gross-Krook (BGK) collision model Ω( ) = −( −  eq )/ (where  is the relaxation time determining how fast the equilibrium is being reached, thus related to the kinematic viscosity  via  = 3 + 1 2 ), which preserves density and first-order momentum.Even with this simplistic collision model, Eq. ( 1) was proven to recover Navier-Stokes equation macroscopically [Shan et al. 2006].
Lattice Boltzmann equations.With time  discretized through regular timesteps, space  discretized through a regular grid, and the mesoscopic velocity  discretized at each grid node through a lattice structure with  directions as shown in Fig. 4, one can turn the continuous Eq. (1) into lattice Boltzmann equations (LBE) in dimensionless units [Li et al. 2020] yielding: where   (, ) encodes the distribution  in the -th direction at position  and time ,   is the discrete lattice velocity in the th direction (Fig. 4), Ω  is the discretized collision operator, and   results from external forces projected on distribution space.Through operator splitting, Eqs. ( 4) can be solved in two steps: first, the distribution values are advected via streaming by computing  *  (, ) =   ( −   , ) , (5) followed by a collision step expressed as Macroscopic quantities are then updated using discretized versions of Eqs. ( 3), which, when incorporating external forces, read: Collision models.While the BGK collision model described above makes an LBM solver provably equivalent to the Navier-Stokes equations for a fine enough discretization, its first-order nature was recognized as a major limitation for a fixed grid size.In recent years, many involved collision models have been formulated which offer higher-order accuracy.In particular, the non-orthogonal centralmoment multiple-relaxation-time model (NOCM-MRT [De Rosis 2017;De Rosis and Luo 2019]) and the cumulant-based model (CU-MRT [Geier et al. 2015[Geier et al. , 2017]]) have both been proven to be particularly accurate collision models.For instance, Li et al. [2020] and Lyu et al. [2021] leveraged a NOCM-MRT collision model where the distribution function is mapped onto a central moment space and each moment is relaxed at a different rate before reprojecting the post-collision moments back into a distribution function; a further optimization of higher-order (non-physical) relaxation rates can also be added to offer low-diffusion and low-dispersion simulations.Very recently, Lyu et al. [2023] show that a cumulant model, which performs collision by mapping onto a cumulant space this time, can provide even higher accuracy for very high Reynolds numbers if higher-order cumulants are optimized to maximize entropy.

Towards lightweight LBM
However, all single-phase turbulent fluid simulations in computer graphics thus far need to store two distribution functions for their temporal update and the first two macroscopic quantities from Eq. ( 7), leading to at least 58 variables per node -without considering extra temporary memory used, for instance, for the projection from distribution to moment spaces in the evaluation of the collision terms.While more complex in-place streaming methods (such as the swap approach of Latt [2007] or the esoteric twist of Geier and Schönherr [2017]) can remove a copy of the distribution function, typical memory requirements for high-resolution simulations rule out the use of consumer GPUs.
A moment representation for LBM (MR-LBM [Vardhan et al. 2019;Gounley et al. 2022;Ferrari et al. 2023]) has recently been formulated.It requires far fewer variables per node, thus relaxing memory requirements at least threefold compared to LBM solvers in graphics such as [Li et al. 2020], seemingly ensuring its applicability to consumer GPUs.The MR-LBM consists of three steps: (a) from the three stored moments , , , a distribution function is reconstructed via (we use Einstein notation to indicate tensor contraction): (  ) , (8) where   are lattice weights and } , represents the Hermite polynomials of order 2; (b) the reconstructed distribution function is then streamed using Eq. ( 5); (c) the resulting distribution is converted back to its three moments using Eqs.( 7), and the BGK collision model and external forces are incorporated (effectively replacing Eq. ( 6)) by simply updating the current moments through: (10) For simplicity, in this paper, we call the first step (a) the "distribution reconstruction" step, the second step (b) "regular streaming" while the third will be denoted by the "moment-based collision" step.

Discussion
While the MR-LBM uses about a third less memory than a traditional LBM and about 50% of the memory required by an in-place implementation, it cannot handle flows with Reynolds numbers above 4000.There are two reasons for this severe limitation.First, in the distribution reconstruction step, a second-order reconstruction is used, which introduces truncation errors to macroscopic moments for turbulent flows.Second, the moment-based collision step assumes a BGK collision model which is well known to be very limited as it is only first-order accurate, often bringing large errors into macroscopic moments.As a result, the gain in memory size comes with a stringent limitation on the viscosity of the fluid: one cannot produce the high Reynolds number flows that a regular LBM implementation is typically successful at generating.

HIGH-ORDER MOMENT-ENCODED LBM
We now introduce our High-Order Moment-Encoded LBM (HOME-LBM) approach which removes most of the shortcomings of MR-LBM by modifying its first and third steps to gain substantial accuracy and stability.In this section, we will assume a D3Q27 lattice discretization in 3D (resp., D2Q9 in 2D) as typically recommended for turbulent flows to ensure numerical accuracy.

Hermite polynomials expansion
Continuous expressions.As a reminder, distribution functions in LBM are traditionally expressed through Hermite series expansions [Shan et al. 2006]: where  is the order of approximation;  () is a weighting function defined as  () = exp(−∥ ∥ 2 /2)/(2) /2 ; the superscript "[]" indicates a rank- tensor; the operator ":" denotes full tensor contraction; and  [] is the Hermite polynomial of order , a -th order tensor defined as Since Hermite polynomials form an orthonormal basis (for the  2 inner product weighted by ), the coefficient  [] (, ) of a given distribution  can be computed via a weighted inner product: [] ()d .
Note that the first three orders of coefficients coincide, in fact, with the three first velocity moments from Eq. ( 3) since  [0] =  ,  [1] =  , and Discrete distributions.Now, when a lattice discretization of the local velocity  into  discrete microscopic velocities {  } −1 =0 ( = 9 in 2D,  = 27 in 3D, see Fig. 4) is used and using the lattice weights   from Eq. 8 as quadrature weights, the discrete distribution function corresponding to  (•, , ) are the  values {  } −1 =0 per node and per timestep defined as: Eq. ( 13) is then evaluated through Gauss-Hermite quadrature as: One can verify that the first three orders of coefficients  [0] ,  [1] , and  [2] corresponds to the discrete moments from Eqs. (7).

High-order distribution reconstruction
In the MR-LBM approach [Vardhan et al. 2019;Gounley et al. 2022;Ferrari et al. 2023], discrete distribution functions are not stored on the computational grid to avoid large memory footprint, but replaced by only the first three velocity moments , , and  at each grid node instead.From these moments, a second-order Hermite approximation of the distribution of the form of Eq. ( 11) that matches these velocity moments is derived by simply choosing only three non-zero coefficients ( [0] ,  [1] ,  [2] ) in the Hermite expansion and setting these coefficients to the three known moments (Eq.( 14)).This low-order reconstruction of the 27-direction values   leads to a very simple expression of the distribution function given in Eq. ( 8) -and hence, a very cheap reconstruction -but introduces significant truncation errors due to its particularly low-order in the case of turbulent flows, making it ill-adapted to our needs.Regularized distributions.Instead, we propose to leverage existing works to reconstruct a third-order Hermite approximation of the distribution which will improve the precision and numerical stability of our HOME-LBM.Latt and Chopard [2006] introduced the concept of regularized distribution for standard LBM, which stemmed from making sure that the BGK collision model leads to the correct macroscopic equations in the hydrodynamic limit, by filtering distribution functions after streaming and before collision to remove unwanted oscillations ("ghost modes") and instabilities in the simulation.From the current distribution  of a node, the first three velocity moments , , and  [2] are evaluated, and a particular choice of a second-order Hermite truncated series for the off-equlibrium part  off (i.e., the original distribution minus the equilibrium distribution  eq (, )) is chosen to ensure that the regularized distribution satisfies key symmetries.Malaspinas [2015] proposed another regularization, now relying on a full sixth-order Hermite series (in 3D), where the various coefficients  [>2] are recursively evaluated from , , and  [2] to remove the typical numerical inaccuracies of high-order moments.
Third-order Hermite reconstruction.Compared to the previous work mentioned above, we are not trying to filter an existing distribution, but to reconstruct one instead -but we can use their approach to turn our three velocity moments into a valid distribution function.While regularizations are often based on second-order Hermite truncations or full sixth-order hermite series, we opt for a third-order Hermite truncation to offer a good balance between accuracy and computational efficiency.Since the moment representation of LBM stores  (to which we set  [0] ),  (to which we set  [1] ), and  (to which we set  [2] ), we can use the recursive computation of third-order coefficients  [3] from Malaspinas [2015] as is for our reconstruction of a third-order Hermite approximation of the distribution.The advantages are obvious: due to their regularization procedure, the reconstructed distribution is void of significant ghost modes; but our truncation at the third term of the Hermite expansion offers improved computational efficiency, obviously.
Reconstruction in closed-form.While the recursive evaluation of  [>2] from [Malaspinas 2015] can be used, we can also express the resulting coefficients directly in closed form as a function of the first three velocity moments , , and , leading in 3D to the following third-order Hermite reconstruction expression for each of the distribution function   : We also provide the formula for the 2D reconstruction of a distribution from the three moments in App.B (see Eq. ( 29)), as well as all the expressions of the Hermite terms in App.A to ease implementation.

High-order collision model
The original MR-LBM relies on the BGK collision model, whose firstorder nature can only handle low Reynolds numbers.For turbulent flows, one needs a higher-order collision model instead.Multiple relaxation time (MRT) models, which convert the distribution function into a central-moment or cumulant space in which each component gets relaxed towards its equilibrium with an individual rate before being converted back, have been successfully used in graphics in recent years, be it for single-phase [Li et al. 2020;Lyu et al. 2021Lyu et al. , 2023] ] or multi-phase [Li et al. 2021[Li et al. , 2022;;Li and Desbrun 2023] flow simulation.We incorporate an MRT moment-based collision step in HOME-LBM too in order to ensure higher accuracy.
Central-moment collision in regular LBM.An MRT-based approach to collision, named NOCM-MRT, uses non-orthogonal central moments  = {  } =0..26 , which are derived from distribution functions  = {  } =0..26 via a linear transform  =  , where  is a matrix known in closed form as a function of the macroscopic velocity .By subtracting the equilibrium distribution projected into the same central moment space, one gets the off-equilibrium central moments  −  eq , which are relaxed by individual rates   , before the result is converted back to a distribution by multiplying it by  −1 , which provides the collision terms Ω  from Eq. ( 4) -see, for instance, [Li et al. 2020].If one also incorporates external forces, the NOCM-MRT collision is expressed as: where Ω Ω Ω comprises all collision terms Ω  ,  is the diagonal matrix containing all individual relaxation rates   , and  represents the force terms projected into central-moment space.
Collision in HOME-LBM.In order to improve efficiency, we propose an alternate evaluation based on our velocity moments, still affording a high-order collision evaluation.After streaming the reconstructed distribution, we directly compute the three new moment  * ,  *  * , and  *  * for each of the resulting distribution per node based on Eq. ( 7) before the collision step.(The use of the superscript asterisk indicates that these moments are temporary: they will be altered by the collision.)Knowing  * and  * on a grid node allows us to know the equilibrium distribution defined by Eq. ( 2).The resulting central-moment  eq =  eq and the external force  are then projected into the Hermite-based form of Eq. ( 11): following [Li et al. 2020], we use a sixth-order Hermite expansion here, leading to mostly zero terms except for: for the central moments, and, for the force-related terms, To evaluate the actual post-streaming distribution  from the updated velocity moments, we chose a third-order Hermite expansion to improve upon previous works (which all used a second-order one) while keeping the collision evaluation simple.Indeed, the resulting effect of the collision encoded by Eq. ( 18) ends up being quite straightforward: the post-collision velocity moments do not require the lengthy evaluation of (and multiplications by) matrices  and its inverse, as the closed-form expressions of the three velocity moments (found with Matlab [Matlab 2023]) are . Note that these expressions (except for the trivial Eqs. ( 21)-( 22) which just reflect density and linear momentum preservation of the collision process) are far simpler than having to deal with matrix  and its inverse in the usual LBM collision process: therefore, the moment-encoded LBM provides a great boost in efficiency due to these simple collision updates, see Fig. 12.For completeness, we provide the expressions for the 2D case in App. C.
Cumulant-based collision model.We can also proceed similarly to incorporate the cumulant-based collision model into our momentencoded solver; we provide the resulting closed-form expressions in App.D. Due to the more involved nature of the cumulant model (i.e., its non-linear projection), the update rules are three times as long.Therefore, all our results use the central-moment based collision (Eqs.(21-23)), except for Fig. 8 where we test both models.

Moment-based single-node coupling
Now that we have addressed how to integrate the lattice Boltzmann equations (Eqs.( 6) ) in our moment-encoded LBM context, the last issue to tackle is fluid-solid coupling.One of the simplest approaches to coupling is the moving bounce-back method [Ladd 1994], which alters the streaming process near a boundary by reversing the distribution function advection against the boundary and applying a momentum exchange based on the velocity of the solid encountered.However, its first-order nature often leads to spurious oscillations especially in turbulent flows, which spurred the introduction of a hybrid velocity-correction method [Lyu et al. 2021] to offer stable two-way coupling, at the cost of larger stencils.We favor a more direct and local approach that requires less computations by leveraging our high-order distribution reconstruction.Streaming.Streaming is performed on all unobstructed links without any change.Now if a link to a grid node  intersects a solid boundary (a case often called a "cut-cell link"), we alter streaming so that an approximated distribution function from the solid boundary is streamed to the grid node.Let  be the point of the intersected solid on the link (see inset).Our approach simply amounts to stream a reconstructed distribution value    to , as we detail next, so that any type of obstacle geometry (even thin shells) can be handled.
Solid distribution.Given our three-velocity-moment setup, we can approximate    easily.First, since we are simulating a homoegenous weakly-compressible fluid,   =   .As for the velocity   , we simply pick it based on the motion of the solid boundary at , which is known exactly based on the linear velocity and angular velocity of the solid.We thus just need to form an approximation of the rank-2 tensor   .We know that the equilibrium tensor given the known moments   and   would be  ( This approximation is in fact tantamount to assuming ∇|  = ∇|  since we know that the off-equilibrium part of  is proportional to the local stress tensor [Malaspinas 2015].While this may be too coarse of an approximation for very high Reynolds number (generating thin, fast varying boundary layers), we found it to be sufficient for graphics purposes -and better approximations could be derived from neighboring values of nodes on the same side of the obstacle, albeit at a higher computational cost.Now that we have the three velocity moments at , we evaluate the outcoming value    using Eq. ( 17) and stream it to .Force on solids.In kinetic theory, momentum exchange is used to calculate the resulting force on a solid when fluid particles hit it, expressed in a Galilean-invariant way [Peng et al. 2016] as: where {in,out} refer to the velocity index of the cut link and its opposite.One can thus evaluate the impact of the fluid motion onto a solid by summing all the contributions for all nodes  whose links ℓ  intersect the solid to get the force   and torque   through: where  ′ is the opposite direction to , and we use () to denote the solid point  on the link associated to grid node , and   to denote the barycenter of the solid.As force and torque expressions are in LBM space, we further map them to physical space before sending them to the rigid-body integrator, as in [Li et al. 2020].

RESULTS
We now evaluate of our HOME-LBM approach by discussing a few implementation details first, then presenting a number of tests, simulations, and comparisons to previous work.All results were run on a workstation (AMD Ryzen 9 7900X3D 12-core processor) equipped with an NVIDIA GeForce RTX 4090 with 24GB of GPU memory.We also provide all relevant statistics in Tab. 1, obtained with NVIDIA Nsight to profile kernel time.

Implementation details
We implemented our approach (see pseudocode in Alg. 1) in C++ and CUDA, using a structure-of-arrays (SOA) data structure to store 20 variables per grid node: 10 per grid node to store the velocity moments, with two copies to facilitate the time update.For our solid boundary treatment, we follow the approach of Li and Desbrun [2023] by first constructing a bounding volume hierarchy tree structure for the 3D mesh model on the GPU; for dynamic objects, we use cut-cell flags and bounding boxes to accelerate link-mesh intersection.Given the low peak memory and efficiency of our HOME-LBM scheme, turbulent flow simulation with static objects is achieved with only one GPU kernel pass which contains streaming, boundary treatment (link-mesh intersection) and collision.Though not limited to such a choice, we use 8×4×4 blocks and assign each node to one CUDA thread.To speed up the computation, we load the node data, save the reconstructed 27 distribution function values into shared memory, and fetch the data when required for the following step within the kernel.Since the streaming step requires accessing neighboring nodes, we also store one halo layer of nodes around the block in our shared memory to further accelerate GPU data access.For simulations involving dynamic objects, we use two kernel passes to improve performance, the additional one focusing on cut-cell flag update.When realtime performance and memory size are particularly pressing requirements, we also follow [Geier et al. 2017] to improve floating-point precision by replacing   with   −  , which slightly changes the computation of  (we need to add    = 1back).This trick, allowing for greater off-equilibrium components, lets us use 16-bit floats (instead of 32-bit floats) without stability or visual consequences; our realtime session from Fig. 14 was achieved with this variant, where we gained another factor two in memory and a speedup factor of 1.3 using 8×8×4 blocks this time.
Finally, we can advect particles along the macroscopic velocity field with a RK3 method [Ralston 1962] for visualization purposes.

Accuracy and efficiency evaluations
3D Taylor-Green vortex.We first test our solver's numerical accuracy using the three dimensional Taylor-Green vortex (TGV) case, as in [De Rosis and Luo 2019]  Flow past a sphere.We also evaluated our solver in Fig. 8 with different LBM-based solvers for the simulation of a flow past a spherical obstacle at a very high Reynolds number  = 800, 000 using a computational grid of 200×400×200.While the original MR-LBM [Ferrari et al. 2023] (a) crashes early, we see that the approach of [Li et al. 2020]  results of the full-LBM cumulant-based model well despite its only third-order nature, and does not exhibit ringing artifacts either.
Accuracy and robustness analysis.We also performed experiments to evaluate our solver's accuracy.We first test the accuracy of HOME-LBM using the 2D Taylor-Green vortex example (for which a closedform solution is known) proposed in [Zehnder et al. 2018] and used in the NOCM-MRT work of [Li et al. 2020].Compared to these two state-of-the-art Navier-Stokes vs. LBM solvers for different time steps and grid sizes, Fig. 9 demonstrates that our compression of the distribution to its first three velocity moments does not affect  the resulting accuracy of the solver as we almost systematically match the results of [Li et al. 2020], with roughly two orders of magnitude smaller velocity root-mean-square error than [Zehnder et al. 2018] -exemplifying the current superiority of LBM solvers.Note that this result using a D2Q9 discretization is likely to be better than the actual reconstruction error generated by HOME-LBM for a D3Q27 discretization, but it provides clear evidence of the advantage that our moment-encoded approach has over common Navier-Stokes solvers.We then use a neutral LBM simulation (we picked the NOCM-MRT simulation [Li et al. 2020] of the jet flow past a sphere from Fig. 8) to test the difference between a secondorder vs. a third-order Hermite reconstruction method (i.e., Eq. ( 8) vs. Eq.( 17)): for each frame of the animation, we pick all the distribution functions, compute their first three velocity-moments, then use either Eq. ( 8) or Eq. ( 17) to reconstruct them.When we measure the  2 error of the third-order moment from the reconstructions compared to the third-order moment of the original distributions, we see around one to two orders of improvement in the third-order Hermite reconstruction (Fig. 10).Finally, we prove in Fig. 11 that both our high-order distribution reconstruction and moment-based collision model are important: substituting one of these contributions with a lower-order approximation leads to blowups.
Performance comparison.To show our solver's efficiency, we compare the timing cost per iteration for an identical simulation, chosen to be the one from Fig. 8.We use the optimized version of NOCM-MRT from [Lyu et al. 2021] which employs LU decomposition and shared memory to accelerate the approach of [Li et al. 2020]   NOCM-MRT [Lyu et al. 2021], and our HOME-LBM), we compare the different average times per iteration of these solvers.HOME-LBM is on average 5 times more efficient than NOCM-MRT and 10 times more efficient than CU-LBM.Moreover, our smaller memory footprint allows us to reach fine grid sizes for which the two other methods run out of memory.In particular, our moment-based collision is a key factor in our timing gains.
same simulation for 2000 iterations, count the computational time using the clock() function, and deduce the mean time per iteration for each method.Results are shown in Fig. 12: HOME-LBM ends up about five times faster than NOCM-MRT (in particular because ours only needs one kernel pass and significantly less memory, which significantly reduces data read/write and boosts performance) and over an order of magnitude faster than cumulant collision model; moreover, on a GPU with 24GB memory, HOME-LBM can support resolutions of up to 500×1000×500, while the two others only go up to 350×700×350 before running out of memory.Note that in particular, our novel moment-based collision is an important factor in our timing gains since using the NOCM-MRT model from [Lyu et al. 2021] on the distribution function right after streaming instead ends up being four times slower.

Offline simulations
Now, we go over different test results that we ran to illustrate our solver ability.Note that we employed Bullet [Coumans and Bai 2021] as our rigid-body solver, and final results were rendered using the GPU-accelerated 3D renderer Redshift [Maxon 2023] 3D coupling with static objects.In Fig. 1, we demonstrate a flow past a detailed mesh of a tractor.Five different colored smoke injections are used to visualize the details of the flow and wake.In Fig. 2, an octopus mesh with intricate topology (as it contains several serpentine grooves) is placed in a turbulent wind field.Smoke flowing through the complex obstacle exhibits a variety of vortex details.Fig. 5 shows a complex wake created by a fern in the wind.
3D coupling with dynamic objects.Fig. 6 shows a rotating aircraft propeller in a wind field.In this one-way coupling example, the wind entrains the vortices generated by the propeller's dynamics to the right of the frame.In order to visualize the flow details, we insert four passive smoke injectors at the tips of the four blades.Fig. 15 shows the motion of a hair comb creating fine vortices.We also demonstrate two-way coupling examples.Fig. 13 shows leaves of different weights being dropped on top of a jet flow and falling under the action of gravity.Fig. 16 shows a rotor of 2kg and 3.75m of diameter, where a torque is applied to make it rotate at 19 rad/s.This makes it lift off, but as soon as the torque stops, it falls back onto the floor.Finally, in Fig. 17, two car models of widely different weights are dropped to the floor.Due to air resistance, the cars end up with different terminal speeds and hit the ground at different times, resulting in different effects on the surrounding smoke.

Real-time simulation
We also present a realtime session entirely run on an NVIDIA GeForce RTX 4090 GPU card in Fig. 14.With this single GPU, we were able to manipulate interactively the displacement of a rigid body (a bowl in this case) and run our HOME-LBM simulator with a 196×196×196 grid, as well as an in-house volume renderer.Note that graphics fluid solvers are also able to achieve realtime through the use of large time steps and/or multigrid pressure solves, but as argued in [Li et al. 2020], the visual quality and accuracy of LBM simulation (and in particular, its ability to handle nearly inviscid fluids) are significantly higher, thus resulting in improved realism.

CONCLUSION
In this paper we introduced HOME-LBM, a lightweight, yet lightspeed simulator of fluid turbulent flows and rigid body coupling based on high-order moment-encoded LBM solver.It offers a fresh perspective on the LBM method: while all previous graphics approaches using kinetic solvers had distribution functions as their main variables, we show that keeping only the first three velocity moments leads to a smaller memory footprint and a far improved computational efficiency.Compared to other computational fluid dynamics approaches using the same moments to encode the kinetic variables, we introduce a higher-order reconstruction of the distribution function for the streaming step, and show that the collision step can be done with a simple closed-form update rule of the moments, improving the efficiency of the solver while offering an accuracy close to the full-blown distribution-based solvers.
This new type of solver opens up a number of possible research directions.First, our third-order Hermite-based collision model leads to relatively simple update rules for the central moment version, but the cumulant-based version (which we provide in App.D) is slightly more involved; it would be useful to see if simpler rules could be found, still ensuring at least a third-order approximation.Note that if more moments are stored, one could also go up in order -but we believe that third-order offers a good compromise for graphics as it suffices for most typical scenarios of visual fluid simulation and offers particularly simple computations for the case of single-phase fluid simulation.Moreover, the efficiency and memory footprint that we now reach gives hope that one could treat compressible and thermal flows with a kinetic solver with significantly less memory than the current 3103 lattice structures used in compressible LBM works in CFD [Coreixas et al. 2017].It could require the addition of other macroscopic variables such as temperature terms, and maybe higher-order reconstruction and collision.Furthermore, extending our moment-based approach to LBM multiphase solvers would be an equally promising direction for future research.

ACKNOWLEDGMENTS
The tractor model in Fig. 1 is from grabcad, the octopus in Fig. 2 is from Thingi10k, the fern in Fig. 5 is from cgtrader, the leaf in Fig. 13 is from archive3d, the comb in Fig. 15 is from cgtrader, while the propeller in Figs. 6 and 16 is from free3d.Mathieu Desbrun acknowledges the generous support of Ansys and Adobe Research, as well as a Choose France Inria chair.[Lyu et al. 2021, Fig. 4], we simulate a hair comb containing sixty teeth moving around and creating fine vortices in its wake, properly capturing the complex fluid-solid interaction engendered by this finely-detailed geometry.

B 2D MOMENT-BASED RECONSTRUCTION
For the reader's convenience, we also provide the 2D reconstruction of a distribution from its three velocity moments , , and , in order to complement the 3D expression from Eq. ( 17):

D 3D CUMULANT-MOMENT-BASED COLLISION
Finally, we also provide the closed-form expressions one gets when using the cumulant-based approach from [Geier et al. 2017] in our moment-encoded context:    (,

Fig. 1 .
Fig. 1.Lightweight and lightspeed LBM smulation:We contribute a new kinetic solver for incompressible and single-phase fluid simulation interacting with complex solids.By relying on only three moments per grid node instead of the usual 27-direction velocity discretization, we reduce memory usage threefold and improve efficiency by an order of magnitude, with limited impact on accuracy.As a consequence, a high-quality visual simulation of a tractor in a wind tunnel (with passively-advected colored particles to visualize the flow) can be achieved efficiently on a commodity GPU.

Fig. 2 .
Fig. 2. Octopus: A complex object in the shape of an octopus with multiple serpentine grooves is placed in a turbulent wind field.A turbulent wake is formed.

Fig. 3 .
Fig. 3. Delta wing.The airflow over a thin-shell delta wing is simulated with HOME-LBM (left), exhibiting the stable spiral vortex structure near the leading edge of the wing seen in experiments [Délery 2001] (right).

Fig. 4 .
Fig. 4. Lattice structures: In LBM, distribution functions are often discretized on a D2Q9 lattice (left) or on a D3Q27 (right) in 3D, offering a discretization of both space via grid nodes and velocity via (orange, blue, and green) links which properly resolves turbulent flows.

Fig. 5 .
Fig. 5. Wind through a fern: A static 3D model of a fern is placed in a strong flow field, generating a complex wake (visualized with particles).

Fig. 6 .
Fig. 6.Propeller: A spinning propeller placed in a constant flow field going towards the right creates complex vortical patterns.In this simulation, passive colored particles are emitted at the tips of the four blades purely for visualization purposes.
we now need to add an approximation of its off-equilibrium part.We simply use the off-equilibrium moment of node , resulting in the expression:    =       +     −       .
Fig. 8. Comparisons.For a strong flow past a sphere, we run various LBM solvers based on different collision models: (a) MR-LBM [Ferrari et al. 2023] which blows up at an early frame, (b) original NOCM-MRT [De Rosis 2017], (c) optimized NOCM-MRT [Li et al. 2020], (d) CU-MRT [iRMB 2023], followed by HOME-LBM with cumulant-based (e) and central-moment-based (f) collisions.While our approach does not capture exactly the high-order cumulant collision model (e), both our options behave similarly, without the ringing artifacts of (c) or the clear oversmoothing of (b) as exhibited by a cross-section of flow field indicating the velocity magnitude.

Fig. 9 .
Fig. 9. 2D Taylor-Green vortex test.We compare the reflection-advection (MC+R) [Zehnder et al. 2018], NOCM-MRT LBM [Li et al. 2020] and HOME-LBM solvers for (a) different time step sizes and (b) spatial grid sizes on the 2D Taylor-Green vortex simulation.The velocity root-mean-square error is computed based on the known analytical solution.
, and the GPU-based cumulant collision model from [iRMB 2023].We run the low-order reconstruction, high-order collision high-order reconstruction, low-order collision

Fig. 11 .
Fig. 11.Ablation study.To test the individual effects of the high-order distribution reconstruction and high-order collision model, we run the simulation scenario from Fig. 8 with (left) low-order reconstruction and highorder collision, vs. (right) high-order reconstruction and low-order collision.Both cases blow up (yellow regions indicate NaN at time of blow-up), with low-order distribution reconstruction stopping very early on.

Fig. 12 .
Fig. 12. Efficiency of LBM methods: By running the same "flow past sphere" example with three implementations (CU-LBM [iRMB 2023], NOCM-MRT[Lyu et al. 2021], and our HOME-LBM), we compare the different average times per iteration of these solvers.HOME-LBM is on average 5 times more efficient than NOCM-MRT and 10 times more efficient than CU-LBM.Moreover, our smaller memory footprint allows us to reach fine grid sizes for which the two other methods run out of memory.In particular, our moment-based collision is a key factor in our timing gains.

Fig. 13 .
Fig. 13.Leaves blown by a jet: Several leaves of different weights are dropped on a jet flow, demonstrating two-way coupling effects.

Fig. 14 .
Fig. 14.Real time demo: a real-time session where a user moves around an object (here, a bowl) which interacts with a jet flow is entirely done on a single GPU card, volumetric rendering included.

Finally
, we simulated the airflow over a delta wing with a 75 • -sweep angle for an angle of attack of 20 • .As Fig. 3 demonstrates, our simulation reproduces the stable spiral vortices above the wing which are responsible for aerodynamic lift, matching the experimental visualization from [Délery 2001].

Fig. 15 .
Fig. 15.Air comb: Inspired by the hair brush example of[Lyu et al. 2021, Fig. 4], we simulate a hair comb containing sixty teeth moving around and creating fine vortices in its wake, properly capturing the complex fluid-solid interaction engendered by this finely-detailed geometry.

Fig. 17 .
Fig. 17.Falling cars.In this two-way coupling example of a very light car (top) and of a heavier car (bottom) dropped on the floor from a height, the car's terminal speed ends up being very different for each case, and its wake disturbs a static smoke ring.

Table 1 .
Statistics.All examples timed on a NVIDIA GeForce RTX GPU.