Parallel SPH Simulation and Rendering
Yu Mao(ymao1@andrew.cmu.edu)
CheYuan Liang(cheyuanl@andrew.cmu.edu)
Summary
We implemented parallel SPH fluid simulation & rendering. The project consists of two parts. the is parallel SPH fluid simulation. First, we paralleled the SPH simulation using CUDA, by explioting the parallel data structure and lockfree algorithm to scale up the computation, and taking the advantage of spatial locality from the datastructure to launch the CUDA kernel to explore low latency of the CUDA share memory. The second part is fluid rendering, which is the process of rendering realistic fluid surfaces from a collection of particles produced by SPH. To keep up with SPH simulation, we need an efficient approach for fluid rendering.
Fluid Simulation
Background
Real world phenomena like fluid to interactive applications (such as games) creates immersive experience. Among many fluid simulation approaches, Smoothed Particle Hydrodynamics (SPH) is simple and elegant and is applied in many video games.
Smoothed Particle Hydrodynamics
Following the paper Particlebased fluid simulation for interactive applications and several other references from Intel, we write from scratch the SPH fluid simulation using C++. We tested the code by plotting every frames using matlab.
SPH Fluid Simulation from Yu Mao on Vimeo.
SPH pseudocode & strategy:
Overall speaking, the SPH simulation is inherently a pararllelizable algortihm. For each particle, we do the same computation during the simulation loop, which is exactly the same concept of Single Program Multi Data (SPMD).
In order to scale up to program, there are several aspects can be think of:
 Exploring the datastructure that allows us to look up the neighboring particle attributes fast and remain the low memory usage.
 Exploring different stratagy to parallelizing the computation in either grid's perspective or particle's perspective.
Optimizing the sequential algorithm
Before dive into any forms of parallelization, we first optimized the sequential code by using uniform grid. Though it is easy to find the neighborhood, it comsumes memory linearly with regard to the simulation space.
The of uniform grid idea is that given any point in the space with xyz position, we can always index the particle into a grid through the striplike indexing (i.e. floor(x) + floor(y * x_dim) + floor(z* x__dim * y_dim) ) in constant time, and accessing the neighboring data in constant operation.
We implementment with std::vector, which is simple to push the particle index in the vector directly, but the pushlike operation is a sequencial way of thinking, which will become the bottleneck to scale up the program during high contention. Therefore we would introduce the parallel way to construct the datastucture in the following paragraph.
Parallelizing using CUDA
To fully utilize the compute capacity of morden GPU, we ported the existing code into CUDA program.
1. Naive CUDA Implementation (particle's perpective)
In the naive cuda version, we translate the SPH simluaion main part into cuda code.
 The neighborhood finding is still using openMP CPU version with the gridbased approach
 Each cuda thread is responsible for a particle
 We remove the marching cube in the loop. After considering the computation and the graphic quality of output.
Neighbor finding  SPH main loop  host <> device memory copy  

1k  0.86 ms  0.24 ms  0.65 ms 
8k  6.20 ms  0.57 ms  2.7 ms 
64k  36.03 ms  3.4 ms  24 ms 
There are some drawbacks in the naive cuda approach.
 Sequencial data operation: The current neighbor find would be a bottleneck, because of the sequencial push to the uniform grid. As the number of particle grows 64 times more, it takes 36.03/0.86 = 41x more time, while the it only takes 3.4/0.24 = 14x more time for the main loop under the same condition.
 Memory consumption: The gridbased neighbor finding consumes memory propotional to the number of particle and the constant overhead of each of the unit grid to maintain, which is not so memory efficient, as the number of particle increase it would need to increase the memory linearly.
 Memory movement: The data is copying back and foward from host and device each frame, which is a bottle neck ~linear to the amount of particle , since the main loop has been scaled up.
 Saptial locality: The curent approach didn't exploit the hardware of GPU, all of the memory access are load/store from the global memory without any utlization of shared memory.
Optimized CUDA Implementation
We referece a paper Interactive SPH Simulation and Rendering on the GPU that demostrated the how to exploit the hardware in SPH fluid simulation and introduces a datastructrue that takes less memory consumption.
 First, the main idea of the paper is using Zorder curve spacefill method to reduce the memory requirement. The advangtage of sorting the particles in a specific order (i.e. Zorder) is that each grid would then only have to store 2 attributes (start index and size) at most. Since the particle assigned to the same grid cell would have a continus index.
 Second, it computes the particles in a cuda block's perspective. We can load the particles within the kernel (raidus) and the neighborhood particles in to the shared memory of cuda block.
Although the Zorder curve method can reduce the overhead of memory to zero as oppose to the spatial grid mehtod by traversing the particle along the zindex. However, not all of the particle can find the neighbor in the perfect path of zindex curve. In many situations, we have to traverse the nonneighbor particles to reach the desire grid. Also in the context of GPU computing, there is no concept of spatial locally, therefore, we choose to use the gridbase indexing + radixsort to get both low memory consumption and fast neighborhood finding.
In sum, our goal is to construct a datastructure that can index the neighbors in constant time, and each grid only takes two attribute to store the particle index within the grid.
Data structure impementation
Here is our apporach of constructing the grid based ordering datastructure in parallel.
Given two lists of particle indices and its coresponding gridindex (i.e., flattened memory index). We use radixsort and copy the new position and velocity of the particle into the new array in parallel.

Now the particles indices are aligned with the uniformgrid index layout, and some of the particles might lays within the the same "grid index". Our next step is to count how many particles are belong to the grid and what is the starting index.
In order to do this, we have to deal with the synchronization problem. Instead of using the lock, which is not ideal feasible solution in GPU programming, since each threads in a warp have a implicit barrier, and that dependecies would causes the program deadlocks even with a single lock configuration. We use a set of atomic instrcution to update the attributes as suggested in the paper.
Our implementation is shown below:
int idx = threadIdx.x // guard if ((idx >= 0) && (idx < N) && (d_bucket_index[idx] < arraySize) && (device_bucket_index[idx] >= 0)) { atomicAdd(&d_bucket[d_bucket_index[idx]].size, 1); atomicCAS(&d_bucket[d_bucket_index[idx]].startIndex, 0, idx); atomicMin(&d_bucket[d_bucket_index[idx]].startIndex, idx); }
We first use cudaMemset to clear the bucket values to zero before we construct the buckets. We lauches number of threads which is equal to the number of particles in our simulation, then we use an atomicAdd to track how many particles lies within the bucket. And we use the atomicCAS as a flag of the first particle that arrives here. The atomicMin is to keep track the current minimum index within the bucket.
Since eventually we are going to launch as many CUDA blocks as the unit grid to compute the SPH simulation. It is an overhead to fire up the kernel for the empty bucket, so we have to find out the total number of nonempty buckets, and the compact representation of the bucket. This can be done by a set of bit mask operation and exlcusivesum in parallel as practiced in assignment two. With the total number of nonempty buckets and a list, we can easily lauch fewer Cuda blocks and access to the target bucket fast.
Performance
We can see the slope of the sequencial implementation is larger than the parallel implmementaion comapring with the openMP version, which shows it is quiet simeple to scale up for the SPH simulation.

We can see the performance didn't have a lot differnce in either particle's parallelization or gird's parallelizaiton.
There are two factor in the CUDA implementaiton which would affect the result  1. sequential data structure vs. parallel data structure. 2. particle parallelism vs. grid paralleism. The yellow line is the combination of grid/parallel and the gray line is the combination of particle/sequential.
Our explaination is that, for particle/grid parallelism each have it's own adavatange with reagard to the density of the particle in the kernel. For example, if the particle is dense in the SPH kernel, we might want to exploit the grid base parallelism, since then we can load the particles attributes (position/velcity/mass) into the shared memory for the computation. On the contrary, if the particle is sparse in the kernel and it's neighborhoods (i.e. less than 32), we might want ot parallel across each particle, otherwise, we will result in the low kernel thread utilizaiton (because of the minimum therad of the GPU warp is at least 32 in our case). We observed the data and find our particles are quite sparse because of the simulation constant (i.e., rest density) Therefore, the gridbased optimization as shown in the paper doesn't apply to our data distribution. For sequential/parallel datastructure, the parallel data structure always performs better than for large amount of particles, it taks
In sum, we susepect the nondisguinshable gray/yellow line is the result of the combination of worse case. Though our particlebased approach is in favor to the data distruibution, we implemented the sequential datastructure in that case. Though the gridbase apprach use the better datastructure, it doesn't benefit in our particle distribution (sparse).
We also see the CUDA implementation doesn't outperforms the CPU version parallel. Apparently we didn't fully ultilize the GPU throughput of GTX780 for the SPH simulation, the reason is shown above. The particle parallel implementaiton is bounded by the sequntial neighborhood finding, as shown in the previous table.
As the conclusion, we tried to replicate a implimentation as suggested by the paper, however we think our simulation density and kerenl parameter are different, which causes the strategy of making parallelism need to be differnt.
Fluid Rendering from Particles
After performaing SPH simulation, we get a bunch of particle positions each frame. But to make them like fluid we need to do some post processing for rendering. We made a lot of efforts on this part.
First Attempt  Marching Cube
At first, we implemented the classic algorithm Marching Cube to discretize and extract the fluid surface as triangle mesh. We chose to implement marching cube first because it is easy to program and parallelize. The result follows:
We found the Marching Cube algorithm has some inherent drawbacks:
We need to use marching cube with very high resolution, otherwise the result will look very blobby. On the other hand, to look realistic we need more memory to store the marching cube and generated mesh;
The marching cube will generate all of the triangles for the fludi surfaces despite of it visibility, in some cases a large portion of computation is wasted.
Since parallelizing an algorithm that is inherent inefficient and problematic is not worthing, we switched to another approach for fluid rendering, which is called Screen Space Fluid Rendering(SSFR).
Second Attempt  Simplified Screen Space Fluid Rendering
The primary goal behind Screen Space Fluid Rendering(SSFR) is to reconstruct the fluid surface in the viewer's (camera's) space. Concerning ourselves with only the viewer's perspective have potential speed up over methods like marching cubes which attempt to reconstruct the fluids entire surface. SSFR is not without limitations but for generating real time fluid animations it is among the fastest and highest quality currently available. The stateofart apppraoch is given in the Nvidia's paper Screen Space Fluid Rendering with Curvature Flow.
Due to time limit, we implemented a simplified version of SSFR: Instead of smoothing the depth texture and compute normal texture from the depth texture. We directly smooth the normal texture and then compute lightings.
This approach basically consists of 3 draw calls. Assuming the SPH simulation has already been carried out (Figure 0), in the first draw call. The points are splated into spheres (Figure 1) and the normal is computed (Figure 2) and stored. Then In the second draw we utilize OpenGL's compute shader, which is an extension of OpenGL for general purpose GPU computing task, to blur the normal texture (Figure 3). In the third draw call we apply normal mapping and comptute lighting to make the fluid looks realistic (Figure 4).
SSFR is a rendering technique, which means it is desigend and built based on graphics pipeline. So it is inherently parallelized.
Fast blur via compute shader
Although SSFR is a rendering technique implemented based on Graphics APIs. There are still some improvements we can make with regard to efficiency of the compute shader.
In this context, the compute shader is used to blur the normal texture, which essentially appplies a Gaussian filter to the original normal texture.
Separabale Filter
As we all know, Gaussian filter is separable. So there is no need to convolve the image the original texture with a precomputed 2D matrix. Instead we can split the filter into 2 passes: horizontal and vertical. This reduce the complexity per pixel from O(n^2) to O(2n).
Implementation using GPU fixed function sampling
But there are even more intersting things to discover. And the following optimization method we implemented is proposed by Intel from here.
Linear filtering works by taking the weighted average of two pixels surrounding the query point. On today's commodity hardwares, the extra cost of linear filtering is almost zero, which means that we can read two samples value approximately at the cost of one texture read by linear filtering.
To do this, we need to calculate a new set of weights for linear fitlering and a set of scaling factors. So when we are sampling two neighbouring pixels by linear filtering, we get an interpolated value between them where the weights used for interpolation is in propotion to their original Gaussian filter weights. Then we scale the interpolated value by the scale factor precomputed to calculate the sum of weighted pixel values.
Some demo code is given below:
vec3 GaussianBlur( sampler2D tex0, vec2 centreUV, vec2 halfPixelOffset, vec2 pixelOffset )
{
vec3 colOut = vec3( 0, 0, 0 );
// Kernel width 7 x 7
const int stepCount = 2;
const float gWeights[stepCount] ={
0.44908,
0.05092
};
const float gOffsets[stepCount] ={
0.53805,
2.06278
};
for( int i = 0; i < stepCount; i++ )
{
vec2 texCoordOffset = gOffsets[i] * pixelOffset;
vec3 col = texture( tex0, centreUV + texCoordOffset ).xyz +
texture( tex0, centreUV – texCoordOffset ).xyz;
colOut += gWeights[i] * col;
}
return colOut;
}
The code above basically uses 4 texture samples via linear filtering to get 7 sample separable filter, the sample pairs and interpolate position is shown below:
Unless the program is bandwithbounded, we would get approximately 2x speedup for texture blurring.
Demo
CUDA accelerated SPH simulation from Yu Mao on Vimeo.
Code
https://github.com/YuMao1993/SmoothedParticleHydrodynamics
RESOURCES
Test Machine
ghc41.ghc.andrew.cmu.edu
CPU: Intel(R) Xeon(R) CPU W3670 @ 3.20GHz
GPU: NVIDA GeForce GTX 780
References:
[1] Particlebased fluid simulation for interactive applications
[2] Fluid Simulation for Video Games
[3] Marching cubes: A high resolution 3D surface construction algorithm
[4]An investigation of fast realtime GPUbased image blur algorithms