Parallel SPH Simulation and Rendering

Yu Mao(

Che-Yuan Liang(


We implemented parallel SPH fluid simulation & rendering. The project consists of two parts. the is parallel SPH fluid simulation. First, we paralleled the SPH simulation using CUDA, by explioting the parallel data structure and lock-free algorithm to scale up the computation, and taking the advantage of spatial locality from the data-structure to launch the CUDA kernel to explore low latency of the CUDA share memory. The second part is fluid rendering, which is the process of rendering realistic fluid surfaces from a collection of particles produced by SPH. To keep up with SPH simulation, we need an efficient approach for fluid rendering.

Fluid Simulation


Real world phenomena like fluid to interactive applications (such as games) creates immersive experience. Among many fluid simulation approaches, Smoothed Particle Hydrodynamics (SPH) is simple and elegant and is applied in many video games.

Smoothed Particle Hydrodynamics

Following the paper Particle-based fluid simulation for interactive applications and several other references from Intel, we write from scratch the SPH fluid simulation using C++. We tested the code by plotting every frames using matlab.

SPH Fluid Simulation from Yu Mao on Vimeo.

SPH pseudo-code & strategy:

Overall speaking, the SPH simulation is inherently a pararllelizable algortihm. For each particle, we do the same computation during the simulation loop, which is exactly the same concept of Single Program Multi Data (SPMD).

In order to scale up to program, there are several aspects can be think of:

  1. Exploring the data-structure that allows us to look up the neighboring particle attributes fast and remain the low memory usage.
  2. Exploring different stratagy to parallelizing the computation in either grid's perspective or particle's perspective.

Optimizing the sequential algorithm

Before dive into any forms of parallelization, we first optimized the sequential code by using uniform grid. Though it is easy to find the neighborhood, it comsumes memory linearly with regard to the simulation space.

The of uniform grid idea is that given any point in the space with xyz position, we can always index the particle into a grid through the strip-like indexing (i.e. floor(x) + floor(y * x_dim) + floor(z* x__dim * y_dim) ) in constant time, and accessing the neighboring data in constant operation.

We implementment with std::vector, which is simple to push the particle index in the vector directly, but the push-like operation is a sequencial way of thinking, which will become the bottleneck to scale up the program during high contention. Therefore we would introduce the parallel way to construct the data-stucture in the following paragraph.

Parallelizing using CUDA

To fully utilize the compute capacity of morden GPU, we ported the existing code into CUDA program.

1. Naive CUDA Implementation (particle's perpective)

In the naive cuda version, we translate the SPH simluaion main part into cuda code.

  1. The neighborhood finding is still using openMP CPU version with the grid-based approach
  2. Each cuda thread is responsible for a particle
  3. We remove the marching cube in the loop. After considering the computation and the graphic quality of output.
Neighbor finding SPH main loop host <-> device memory copy
1k 0.86 ms 0.24 ms 0.65 ms
8k 6.20 ms 0.57 ms 2.7 ms
64k 36.03 ms 3.4 ms 24 ms

There are some drawbacks in the naive cuda approach.

  1. Sequencial data operation: The current neighbor find would be a bottleneck, because of the sequencial push to the uniform grid. As the number of particle grows 64 times more, it takes 36.03/0.86 = 41x more time, while the it only takes 3.4/0.24 = 14x more time for the main loop under the same condition.
  2. Memory consumption: The grid-based neighbor finding consumes memory propotional to the number of particle and the constant overhead of each of the unit grid to maintain, which is not so memory efficient, as the number of particle increase it would need to increase the memory linearly.
  3. Memory movement: The data is copying back and foward from host and device each frame, which is a bottle neck ~linear to the amount of particle , since the main loop has been scaled up.
  4. Saptial locality: The curent approach didn't exploit the hardware of GPU, all of the memory access are load/store from the global memory without any utlization of shared memory.

Optimized CUDA Implementation

We referece a paper Interactive SPH Simulation and Rendering on the GPU that demostrated the how to exploit the hardware in SPH fluid simulation and introduces a data-structrue that takes less memory consumption.

Although the Z-order curve method can reduce the overhead of memory to zero as oppose to the spatial grid mehtod by traversing the particle along the z-index. However, not all of the particle can find the neighbor in the perfect path of z-index curve. In many situations, we have to traverse the non-neighbor particles to reach the desire grid. Also in the context of GPU computing, there is no concept of spatial locally, therefore, we choose to use the grid-base indexing + radix-sort to get both low memory consumption and fast neighborhood finding.

In sum, our goal is to construct a data-structure that can index the neighbors in constant time, and each grid only takes two attribute to store the particle index within the grid.

Data structure impementation

Here is our apporach of constructing the grid based ordering data-structure in parallel.

  1. Given two lists of particle indices and its coresponding grid-index (i.e., flattened memory index). We use radix-sort and copy the new position and velocity of the particle into the new array in parallel.

  2. Now the particles indices are aligned with the uniform-grid index layout, and some of the particles might lays within the the same "grid index". Our next step is to count how many particles are belong to the grid and what is the starting index.

    In order to do this, we have to deal with the synchronization problem. Instead of using the lock, which is not ideal feasible solution in GPU programming, since each threads in a warp have a implicit barrier, and that dependecies would causes the program deadlocks even with a single lock configuration. We use a set of atomic instrcution to update the attributes as suggested in the paper.

    Our implementation is shown below:

    int idx = threadIdx.x     
    // guard
    if ((idx >= 0) && (idx < N) && (d_bucket_index[idx] < arraySize) 
         && (device_bucket_index[idx] >= 0))
        atomicAdd(&d_bucket[d_bucket_index[idx]].size, 1);
        atomicCAS(&d_bucket[d_bucket_index[idx]].startIndex, 0, idx);
        atomicMin(&d_bucket[d_bucket_index[idx]].startIndex, idx);

    We first use cudaMemset to clear the bucket values to zero before we construct the buckets. We lauches number of threads which is equal to the number of particles in our simulation, then we use an atomicAdd to track how many particles lies within the bucket. And we use the atomicCAS as a flag of the first particle that arrives here. The atomicMin is to keep track the current minimum index within the bucket.

  3. Since eventually we are going to launch as many CUDA blocks as the unit grid to compute the SPH simulation. It is an overhead to fire up the kernel for the empty bucket, so we have to find out the total number of non-empty buckets, and the compact representation of the bucket. This can be done by a set of bit mask operation and exlcusive-sum in parallel as practiced in assignment two. With the total number of non-empty buckets and a list, we can easily lauch fewer Cuda blocks and access to the target bucket fast.


  1. We can see the slope of the sequencial implementation is larger than the parallel implmementaion comapring with the openMP version, which shows it is quiet simeple to scale up for the SPH simulation.

  2. We can see the performance didn't have a lot differnce in either particle's parallelization or gird's parallelizaiton.

    There are two factor in the CUDA implementaiton which would affect the result - 1. sequential data structure vs. parallel data structure. 2. particle parallelism vs. grid paralleism. The yellow line is the combination of grid/parallel and the gray line is the combination of particle/sequential.

    Our explaination is that, for particle/grid parallelism each have it's own adavatange with reagard to the density of the particle in the kernel. For example, if the particle is dense in the SPH kernel, we might want to exploit the grid base parallelism, since then we can load the particles attributes (position/velcity/mass) into the shared memory for the computation. On the contrary, if the particle is sparse in the kernel and it's neighborhoods (i.e. less than 32), we might want ot parallel across each particle, otherwise, we will result in the low kernel thread utilizaiton (because of the minimum therad of the GPU warp is at least 32 in our case). We observed the data and find our particles are quite sparse because of the simulation constant (i.e., rest density) Therefore, the grid-based optimization as shown in the paper doesn't apply to our data distribution. For sequential/parallel data-structure, the parallel data structure always performs better than for large amount of particles, it taks

    In sum, we susepect the non-disguinshable gray/yellow line is the result of the combination of worse case. Though our particle-based approach is in favor to the data distruibution, we implemented the sequential data-structure in that case. Though the grid-base apprach use the better data-structure, it doesn't benefit in our particle distribution (sparse).

  3. We also see the CUDA implementation doesn't outperforms the CPU version parallel. Apparently we didn't fully ultilize the GPU throughput of GTX780 for the SPH simulation, the reason is shown above. The particle parallel implementaiton is bounded by the sequntial neighborhood finding, as shown in the previous table.

As the conclusion, we tried to replicate a implimentation as suggested by the paper, however we think our simulation density and kerenl parameter are different, which causes the strategy of making parallelism need to be differnt.

Fluid Rendering from Particles

After performaing SPH simulation, we get a bunch of particle positions each frame. But to make them like fluid we need to do some post processing for rendering. We made a lot of efforts on this part.

First Attempt - Marching Cube

At first, we implemented the classic algorithm Marching Cube to discretize and extract the fluid surface as triangle mesh. We chose to implement marching cube first because it is easy to program and parallelize. The result follows:


We found the Marching Cube algorithm has some inherent drawbacks:

Since parallelizing an algorithm that is inherent inefficient and problematic is not worthing, we switched to another approach for fluid rendering, which is called Screen Space Fluid Rendering(SSFR).

Second Attempt - Simplified Screen Space Fluid Rendering

The primary goal behind Screen Space Fluid Rendering(SSFR) is to reconstruct the fluid surface in the viewer's (camera's) space. Concerning ourselves with only the viewer's perspective have potential speed up over methods like marching cubes which attempt to reconstruct the fluids entire surface. SSFR is not without limitations but for generating real time fluid animations it is among the fastest and highest quality currently available. The state-of-art apppraoch is given in the Nvidia's paper Screen Space Fluid Rendering with Curvature Flow.

Due to time limit, we implemented a simplified version of SSFR: Instead of smoothing the depth texture and compute normal texture from the depth texture. We directly smooth the normal texture and then compute lightings.

This approach basically consists of 3 draw calls. Assuming the SPH simulation has already been carried out (Figure 0), in the first draw call. The points are splated into spheres (Figure 1) and the normal is computed (Figure 2) and stored. Then In the second draw we utilize OpenGL's compute shader, which is an extension of OpenGL for general purpose GPU computing task, to blur the normal texture (Figure 3). In the third draw call we apply normal mapping and comptute lighting to make the fluid looks realistic (Figure 4).

SSFR is a rendering technique, which means it is desigend and built based on graphics pipeline. So it is inherently parallelized.

Fast blur via compute shader

Although SSFR is a rendering technique implemented based on Graphics APIs. There are still some improvements we can make with regard to efficiency of the compute shader.

In this context, the compute shader is used to blur the normal texture, which essentially appplies a Gaussian filter to the original normal texture.

Separabale Filter

As we all know, Gaussian filter is separable. So there is no need to convolve the image the original texture with a pre-computed 2D matrix. Instead we can split the filter into 2 passes: horizontal and vertical. This reduce the complexity per pixel from O(n^2) to O(2n).

Implementation using GPU fixed function sampling

But there are even more intersting things to discover. And the following optimization method we implemented is proposed by Intel from here.

Linear filtering works by taking the weighted average of two pixels surrounding the query point. On today's commodity hardwares, the extra cost of linear filtering is almost zero, which means that we can read two samples value approximately at the cost of one texture read by linear filtering.

To do this, we need to calculate a new set of weights for linear fitlering and a set of scaling factors. So when we are sampling two neighbouring pixels by linear filtering, we get an interpolated value between them where the weights used for interpolation is in propotion to their original Gaussian filter weights. Then we scale the interpolated value by the scale factor precomputed to calculate the sum of weighted pixel values.

Some demo code is given below:

vec3 GaussianBlur( sampler2D tex0, vec2 centreUV, vec2 halfPixelOffset, vec2 pixelOffset )                                                                           
    vec3 colOut = vec3( 0, 0, 0 );                                                                                                                                   

    // Kernel width 7 x 7
    const int stepCount = 2;
    const float gWeights[stepCount] ={
    const float gOffsets[stepCount] ={

    for( int i = 0; i < stepCount; i++ )                                                                                                                             
        vec2 texCoordOffset = gOffsets[i] * pixelOffset;                                                                                                           
        vec3 col = texture( tex0, centreUV + texCoordOffset ).xyz + 
                   texture( tex0, centreUV – texCoordOffset ).xyz;                                                
        colOut += gWeights[i] * col;                                                                                                                               

    return colOut;

The code above basically uses 4 texture samples via linear filtering to get 7 sample separable filter, the sample pairs and interpolate position is shown below:

Unless the program is bandwith-bounded, we would get approximately 2x speedup for texture blurring.


CUDA accelerated SPH simulation from Yu Mao on Vimeo.



Test Machine


[1] Particle-based fluid simulation for interactive applications

[2] Fluid Simulation for Video Games

[3] Marching cubes: A high resolution 3D surface construction algorithm

[3] OpenMp

[4]An investigation of fast real-time GPU-based image blur algorithms

[5] Screen Space Fluid Rendering with Curvature Flow

[6] Interactive SPH Simulation and Rendering on the GPU