Cavitating and bubbly flows are encountered in many engineering problems involving propellers, pumps, valves, ultrasonic biomedical applications, etc. In this contribution, an openmp parallelized Euler–Lagrange model of two-phase flow problems and cavitation is presented. The two-phase medium is treated as a continuum and solved on an Eulerian grid, while the discrete bubbles are tracked in a Lagrangian fashion with their dynamics computed. The intimate coupling between the two description levels is realized through the local void fraction, which is computed from the instantaneous bubble volumes and locations, and provides the continuum properties. Since, in practice, any such flows will involve large numbers of bubbles, schemes for significant speedup are needed to reduce computation times. We present here a shared-memory parallelization scheme combining domain decomposition for the continuum domain and number decomposition for the bubbles; both selected to realize maximum speedup and good load balance. The Eulerian computational domain is subdivided based on geometry into several subdomains, while for the Lagrangian computations, the bubbles are subdivided based on their indices into several subsets. The number of fluid subdomains and bubble subsets matches with the number of central processing unit (CPU) cores available in a shared-memory system. Computation of the continuum solution and the bubble dynamics proceeds sequentially. During each computation time step, all selected openmp threads are first used to evolve the fluid solution, with each handling one subdomain. Upon completion, the openmp threads selected for the Lagrangian solution are then used to execute the bubble computations. All data exchanges are executed through the shared memory. Extra steps are taken to localize the memory access pattern to minimize nonlocal data fetch latency, since severe performance penalty may occur on a nonuniform memory architecture (NUMA) multiprocessing system where thread access to nonlocal memory is much slower than to local memory. This parallelization scheme is illustrated on a typical nonuniform bubbly flow problem, cloud bubble dynamics near a rigid wall driven by an imposed pressure function (Ma et al., 2013, “Euler–Lagrange Simulations of Bubble Cloud Dynamics Near a Wall,” International Mechanical Engineering Congress and Exposition, San Diego, CA, Nov. 15–21, Paper No. IMECE2013-65191 and Ma et al., 2015, “Euler–Lagrange Simulations of Bubble Cloud Dynamics Near a Wall,” ASME J. Fluids Eng., **137**(4), p. 041301).

## Introduction

Cavitating bubbly flows occur in various engineering problems, such as in ultrasonic biomedical applications, cavitating jets for materials testing, cleaning, and biomass treatments, and unsteady sheet cavities on propeller blades [1–8]. Numerical modeling of such problems is very challenging, since it involves bubble–bubble, bubble–flow, and bubble–wall interactions [7]. In addition, the problem addresses multiscale physics ranging from micron-scale individual bubbles to meter-scale flow field (e.g., propeller blade size). In between, there are very rich and complex mesoscale phenomena such as at the scale of bubble clouds, where the bubbles act collectively [1,2,9–13].

Fully resolved methods, such as direct numerical simulation or the boundary element method, can provide, within their limiting assumptions, details at several scales of interests with impressive progress reported (e.g., Refs. [13–19]), but are limited to detail problems and not appropriate for industrial applications due to high computational cost. In practical engineering applications, cavitating bubbly flows are usually modeled using one of the several approaches: equivalent homogeneous continuum models [5,9,20], Eulerian two-fluid models [21–28], or Eulerian–Lagrangian approaches wherein the bubbles are treated as discrete elements. The former two are based on volume or ensemble-averaged approximations thus suitable for cases where the bubbles are much smaller than the characteristic lengths associated with the overall dynamics of the bubbly mixture. They become inadequate when the scales associated with the individual bubbles are comparable to the mixture flow scales. Eulerian–Lagrangian methods have been developed to treat these cases (e.g., Refs. [29–35]), where the carrier fluid is modeled using continuum fluid methods (Eulerian approach) and bubbles are tracked individually (Lagrangian approach). These methods have shown intrinsic advantages owing to their capability of including the physics across scales down to each individual bubble [1,2]. However, the number of bubbles in actual industrial systems is so large that the computation cost prevents full-scale applications. Therefore, efficient parallelization schemes for significant speedup are needed to reduce computation times.

In this paper, we present an openmp shared-memory parallelization for Eulerian–Lagrangian modeling of cavitating bubbly flows taking advantage of multicore multiprocessing computer systems. These systems usually consist of several multicore CPUs with a large shared-memory pool. In principal, an efficient parallelization should distribute the computation load of both Eulerian and Lagrangian phases as evenly as possible through all the CPU cores (i.e., openmp threads) to reach an optimum load balance. However, bubbles do not remain uniformly distributed in the fluid domain and are not suitable for domain decomposition [36,37]. Therefore, this study employs a hybrid parallelization scheme combining domain decomposition for the Eulerian fluid domain and number decomposition for the Lagrangian bubbles. A similar hybrid decomposition method has also been reported in parallelization of discrete element modeling of particulate flows with ideal load balancing [38–40]. However, a drawback of mixing different decomposition methods is the high overhead involved in nonlocal data fetching when exchanging information between fluid and particle phases during the two-way coupling. The overhead may become considerable and may cancel out parallelization speedup on a NUMA multiprocessing system. Therefore, another major effort in this paper is to address the locality of memory access and reduce nonlocal data fetch latency, thus minimizing any performance penalties.

The paper is organized as follows: The main features of the fully coupled Euler–Lagrange model are summarized in Eulerian-Lagrangian Model section. This is followed by a description of the openmp parallelization scheme. Scaling and efficiency analysis of the scheme are then tested on the problem of the dynamics of microbubble clouds near a rigid wall [1,2]. This case is selected as it is numerically challenging and presents a typical nonuniform bubbly flow configuration where bubbles are confined to a small spherical region in the much larger fluid domain. A comparison of the speedup between nonlocal and localized data access patterns is then provided. Finally, a summary of the findings is presented.

## Eulerian–Lagrangian Model

The three-dimensional (3D) Eulerian–Lagrangian model employed in this study has been extensively validated and documented in our previous studies. These included the investigation of the effects of a propeller flow on bubble size distribution in water [40–42], the modeling of propeller tip vortex cavitation inception [42,43], simulation of the bubbly flow in a bubble augmented propulsor [44], bubble entrainment in plunging jets [45], and wave propagation in bubbly media [1,2,33–35], etc. The present paper addresses acceleration of the computations through parallelization and does not consider additional model development. Therefore, we only provide below a brief summary of the key features of the model and concentrate on the parallelization aspects. Readers interested in more details on modeling aspects may refer to the references provided above.

### Continuum Model for Viscous Mixture Flow.

*u*its velocity,

*p*its pressure, and $\upsilon m$ its kinematic viscosity. The mixture density and viscosity can be related to the liquid (subscript “l”) and gas (subscript “g”) properties by

where $\alpha $ is the gas volume fraction.

Equations (1) and (2) are solved based on the artificial-compressibility method [50], along with a Newton iterative pseudotime stepping procedure (within each physical time step) to obtain a time-accurate solution. The numerical scheme uses a finite-volume formulation, with a first-order Euler implicit difference formula applied to the time derivatives and a flux-difference splitting scheme based on Roe's method [51] and a van Leer's monotonic upstream-centered scheme for conservation laws method [52] for spatial differencing of convection terms. A second-order central differencing is used for the viscous terms.

### Lagrangian Modeling of Dispersed Phase.

*R*is the bubble radius and dots represent time differentiation. $\mu m$ is the two-phase medium dynamic viscosity,

*p*is the liquid vapor pressure, and $pg$ is the gas pressure in the bubble.

_{v}*k*is the gas polytropic constant and is equal to 1.4 for air under adiabatic conditions and one under isothermal conditions, and

*γ*is the water surface tension. $cm$ is the local sound speed in the mixture, which is related to the local void fraction,

*α*, by

where $cl$ is the sound speed in pure liquid [5]. *p*_{enc} and **u**_{enc} are the mixture pressures and velocities averaged over the bubble surface, and **u**_{b} is the bubble center translation velocity.

The introduction in Eq. (4) of $(uenc-ub)2/4$ is to account for slip between the bubble and the host medium. The average terms, *p*_{enc} and **u**_{enc}, are to account for nonuniform pressures and velocities along the bubble surface. They are obtained through an arithmetic average of the mixture quantities, *p* and *u*, in Eqs. (1) and (2), at six polar points on a bubble surface. The use of *p*_{enc} results in a major improvement over classical models, which use the pressure at the bubble center.

This approach has been validated in our previous studies [7,31,42]. For instance, in the simulation of the dynamics of a bubble captured by a vortex, the conventional spherical model overpredicts significantly the bubble size versus time when using the pressure at the bubble center. On the other hand, using Eq. (4) enables a more realistic evaluation of the bubble dynamics with the solution agreeing very well with a fully resolved moving grid method [42].

### Eulerian–Lagrangian Coupling.

The two-way coupling between the Eulerian continuum-based model and the Lagrangian discrete bubble model is realized as follows [1,2,34,35]:

The bubble dynamics and motion of the individual bubbles in the flow field are controlled by the two-phase medium local properties and gradients.

The local properties of the mixture are determined by the bubble size and distribution, and the void fractions and local densities are determined by the instantaneous bubble volume and distributions.

The mixture flow field has an evolving mixture density in space and time and satisfies mass and momentum conservation.

The key of this coupling scheme is to deduce the void fraction from the instantaneous bubble sizes and locations. In our previous studies, the void fraction was obtained by dividing in each computational cell the total volume of the bubbles in the cell by the cell volume. We have found that scheme not totally satisfactory because of the requirement of many bubbles in each cell and also mainly because of the noncontinuous character of *α* across cells resulting in difficulties with differentiation [33].

where $Vjb$ and $Vkcell$ are the volumes of bubble, *j*, and a cell, *k*, respectively; $Ni$ is the number of bubbles which are in the “influence range” to a cell *i*; *N*_{cells} is the number of cells “influenced” by a bubble *j*; and *f _{i}_{,}_{j}* is the weight factor of bubble

*j*contributing to a cell

*i*, which gradually decays to zero as the distance between the bubble and cell increases. Here, the Gaussian distribution function is used as illustrated in Fig. 1 [1,2,34,35].

## Parallelization Algorithm and Implementation

### Algorithm and Program Procedure.

Some multicore, multiprocessing systems have a shared-memory pool consisting of multiple multicore CPUs. For example, dyna-omp, one of dynaflow's multiprocessing workstations, has four 12-core AMD^{TM} Opteron 2.80 GHz CPUs with 128 GB shared memory. A shared-memory openmp parallelization scheme was implemented on dyna-omp to parallelize our 3dynafs Eulerian/Lagrangian coupled model.

As illustrated in Fig. 2, a scheme combining domain decomposition for the continuum domain and number decomposition for the bubbles is used. Both decompositions are selected to realize maximum speedup and good load balance.

The Eulerian fluid domain is geometrically subdivided into subdomains with the continuum computations in each subdomain handled by an openmp thread running on one CPU processor core.

The bubbles are also subdivided, based on their indices, into several subsets with each handled by an openmp thread using a single CPU processor core.

The overall flow chart for implementation of the hybrid parallelization algorithm is shown in Fig. 3. The algorithm shows that the computations of the continuum solution and the bubble dynamics proceed sequentially. During each computation time step, all selected openmp threads (maximum thread number limited by the number of CPU cores available) are first used to execute the continuum computations, with each handling one subdomain. Upon completion, the openmp threads selected for the Lagrangian computations are then used to obtain the bubble dynamics. This procedure is repeated for each physical time step until the final step is reached.

### openmp Implementation.

The advantage of domain decomposition for Eulerian fluid solver is that all cells in a given subdomain (assigned to one thread) are stored in contiguous memory addresses. This is very favorable for memory access since continuum cells need to interact only with their neighboring cells. Only the small portion of continuum cells near the borders of subdomains require information from cells located in different subdomains and have to fetch the data from the global shared-memory. The openmp parallelization is realized based on a multiblock structure of the code. This is done by enclosing the block (i.e., subdomain) loops with openmp directives: “!$omp parallel do…!$omp end parallel.”

Similarly, the Lagrangian bubble computations are done by enclosing the loops of bubble dynamics and bubble motion computations with similar directives. The bubbles are divided into a multitude of subsets based on their indexes with each handled independently by a thread processor. Each thread can efficiently access a contiguous global memory space. Therefore, the openmp implementation for either the Eulerian or the Lagrangian component alone is straightforward.

However, an efficient parallel implementation for the coupling between the phases, in particular the void fraction computations (Eq. (8)) requires careful consideration for data access pattern. Difficulty stems from the two different decomposition methods applied to the continuum and to the bubbles. Nonlocal data fetching is necessary since bubbles located in a continuum computation cell may belong to several different processor cores. This necessitates irregular, remote data fetching, which results in performance penalty. On a NUMA multiprocessing system especially, thread access to nonlocal memory is much slower than to local memory. This high-latency data access mode creates very high overhead, which could cancel out any gains from load balancing.

To overcome this, extra steps are taken to localize the memory access pattern to minimize data fetch latency. This is illustrated with the pseudocode of void fraction computation in Fig. 4. The code stores the void fractions in a global array, void. Following Eq. (8), for a given cell indexed by (*i,j,k,m)* (where *i, j, and k* are the indices along *x, y, and z* directions, and *m* is the block index), multiple bubbles located inside and around the cell contribute void effects to it. To avoid remote data access (e.g., have each bubble access the global memory address corresponding to *void* (*i,j,k,m*) and modify it by adding the void contribution), a temporal array, “void_tmp” is created under each thread and temporally stores the void contribution of all the bubbles assigned to that thread (Fig. 4(b)). Once the bubble loop is completed by all threads, the void values stored in the void_tmp arrays are summed and transferred to the global array void. The temporal arrays are then emptied to release memory. At the beginning of each loop, these temporary arrays must be allocated independently under each thread in order to ensure the locality of data placement by following the “first touch policy” [36]. According to this policy, array data are stored in a memory local to the thread which allocates and initializes the array.

## Test Simulation and Results

To evaluate the developed openmp parallelized Euler–Lagrange model, we apply it to simulate the dynamics of a cloud of microbubbles located near a rigid wall and subjected to a sinusoidal pressure time variation imposed on the computational domain far field boundaries [1,2]. This is a numerically challenging problem since it involves violent bubble volume changes and strong coupling effects fed back to the flow [2,5,9,13]. Also, the problem involves a typical nonuniform void fraction distribution since the bubbles are only present in a small space region while the resolved fluid domain extends in the 3D space much far away than the initial cloud volume. Figure 5 illustrates the dynamics for a bubble cloud of initial radius 1.5 mm, consisting of initially monodispersed bubbles of radius 50 *μ*m and an average void fraction of 5%. Nonslip boundary conditions are imposed at the rigid wall and zero velocity gradients and imposed pressure are imposed on the other “far-field” domain boundaries [1]. This cloud was excited by an imposed sinusoidal pressure variations at infinity with an amplitude of 1.5 atm and a frequency of 10 kHz. The frequency was selected to match the cloud natural frequency [5]. As known from previous studies [1,2], under this frequency, a collective cluster behavior occurs as the cloud experiences a cascading collapse with the bubbles farthest away from the wall collapsing first and the nearest ones collapsing last cumulating the energy. This results in a very high-pressure peak at the wall at the end of cloud collapse, as illustrated in the right panel of Fig. 5. More simulation details and result analysis can be found in Refs. [1,2] and are not repeated here for brevity. We focus instead on the major objective of this paper, i.e., parallelization efficiency tests and analysis.

### Work Load Decomposition and Balancing.

Figure 6 shows how the workload for a case with 65,000 Eulerian grids and 170,000 Lagrangian bubbles is distributed, with different colors (or gray scales in print) corresponding to different openmp threads. Since the mesh is finer in the cloud region than away from it, the corresponding subdomains are smaller than elsewhere to ensure that each subdomain contains the same number of continuum cells. On the other hand, bubbles are evenly subdivided into subsets of equal number, independent of their locations. Therefore, a good load balance is achieved through this decomposition.

### Scaling Tests and Speedup Analysis.

To examine the scalability of the developed scheme, we tested the code for the same bubble cloud shown above on an Intel Xeon CPU node of a Linux cluster. This node has dual six-core 3 GHz Xeon CPUs and a total of 12 GB shared memory. Figure 7 (top) shows the wall time per time step of computation and the speedup factor (bottom) when varying the number of openmp threads. Here, a time step includes both the Eulerian and Lagrangian solution and the coupling through void fraction computation. It is seen that the wall time per time step *i* drops from 923 s to about 81.6 s when the thread number increases from one to 12. The speedup factor is linear and nearly ideal up to 12 threads as shown in Fig. 7 (bottom).

### Comparison Between Nonlocal and Localized Memory Accesses.

To quantify the gains from the procedure of data access optimization for void fraction computation, the speedup factors for each call of void fraction subroutine are compared in Fig. 8 between nonlocal data access and optimized local data access. The speedup is seen to gradually increase as more threads are used. With 12 threads, a 17% improvement can be seen.

For the purpose of further demonstrating the portability of this optimization procedure, similar tests were done on a more challenging NUMA multiprocessing system, dyna-omp. Each of the four CPUs in dyna-omp consists of two NUMA units with each having its own local memory. Thread access to memory belonging to a different NUMA unit has a high-latency and is very inefficient. The corresponding overhead may result in nullifying the acceleration achieved by openmp parallelization. This can be clearly observed in the top plot of Fig. 9, which shows that if the localization of data access is not used, increasing the number of threads over six may *increase* the cost of the void fraction computation instead of reducing it. The number six is critical because beyond six threads more than two NUMA units are involved and remote fetch of data becomes necessary. On the contrary, after applying the localized data access optimization significant speedup can be achieved with higher thread numbers. Figure 9 shows that by increasing the number of threads from one to 48, the speedup factor grows to 44, a parallelization efficiency of about 92%. The above examples also indicate good portability of the developed parallelization scheme since good speedups were achieved with both Intel and AMD multicore multiprocessing systems, which use very different architectures including core number, memory size, and memory types.

## Conclusion

An openmp shared-memory parallelized 3D Euler–Lagrange model was developed for two-phase bubbly flow problems and cavitating flows. This parallelization scheme combines domain decomposition for the continuum domain and number decomposition for the bubbles to guarantee a good load balance, even for nonuniform bubble distributions. The executions of the Eulerian and Lagrangian solvers proceed sequentially. All available openmp threads are first used to execute the continuum computations with each handling one subdomain. The same are then used to execute the bubble computations with each thread handling one bubble subset. All data exchanges are executed through accessing the shared memory. However, special consideration is given in the programming to localize as much as possible memory access through use of temporary local memory in order to minimize remote data fetch latency.

Scaling tests were conducted using the problem of a bubble cloud dynamics near a rigid wall. A nearly ideal speedup could be achieved on both a 12-core Intel Xeon Node and on a NUMA 48-core AMD Opteron workstation.

## Acknowledgment

This work was supported by the Department of Energy under SBIR Phase II DE-FG02-07ER84839. The authors would also like to acknowledge support from the Office of Naval Research under Contract No. N00014-09-C-0676 monitored by Dr. Ki-Han Kim. This work was also supported in part by grants of computer time from The National Energy Research Scientific Computing Center (NERSC), a scientific computing facility for the Office of Science in the U.S. Department of Energy.

## Nomenclature

- $CD$ =
drag coefficient

- $CL$ =
lift coefficient

- $Cm$ =
mixture sound speed on bubble surface

*d*=_{ij}distance between points

*i*and*j**k*=polytropic gas constant

*p*=pressure

- $pg$ =
gas pressure

- $pv$ =
vapor pressure

- $pg0$ =
initial gas pressure

*R*=bubble radius

- $Reb$ =
Reynolds number

- SD =
subdomain

*t*=time

**u**=velocity vector

*α*=void fraction

- $\alpha 0$ =
initial void fraction

- $\gamma $ =
surface tension of liquid

- $\mu $ =
absolute viscosity

- $\mu m$ =
absolute viscosity of the mixture

- $\rho $ =
density

- $\rho m$ =
density of the mixture

**ω**=vorticity