2022-23 Synopsys Research Project
Project Number 112-H10-31
Ryland Goldman • Los Gatos High School
N-body simulations are computer models of systems of bodies typically used for physics and astronomy applications such as predicting orbits of planets. These simulations typically require large amounts of processing power and time for physics computations. To solve this issue, developers use rounding and make compromises on accuracy in order to optimize the software. This project aims to use hardware acceleration rather than mathematical approximations to improve the performance of the simulation, written in Python.
The project compares a NumPy-based approach running on a 16-thread Intel 12600K CPU (compiled with Numba JIT) with CuPy interfacing with a NVIDIA 3090 GPU via the CUDA framework. The CPU group was the control, and CUDA was the experimental group. Two additional test groups used PyOpenCL to directly compare each device. One hundred trials were run on each of the four groups, and repeated using powers of two between 2^13 and 2^18 bodies.
Using 2^16 bodies, the speed up multiple for CuPy was 3.66x, OpenCL (GPU) was 1.05x, and OpenCL (CPU) was 0.56x. This suggests that CUDA is significantly faster than only using the CPU for computations, and the GPU OpenCL implementation was about twice as fast as the CPU OpenCL implementation.
ASR-Video.mp4
- Windows 11 Home Build 22621.900
- Python 3.10.7 with pip CLI
- Python random, time, os, sys, and math builtin libraries
- Nvidia CUDA Toolkit 11.8
- CuPy 11.3.0
- NumPy 1.23.3
- Fastrlock 0.8.1
- Matplotlib 3.6.2
- Contourpy 1.0.6
- Cycler 0.11.0
- Fonttools 4.38.0
- Kiwisolver 1.4.4
- Pillow 9.3.0
- Dateutil 2.8.2
- Six 1.16.0
- Pyparsing 3.0.9
- Packaging 20.0
- FFmpeg 5.1.2
- PyOpenCL 2022.3.1
- Pytools 2022.1.14
- Platformdirs 2.6.2
- Typing-extensions 4.4.0
- Plotly 5.12.0
- Tenacity 8.1.0
- Pandas 1.5.3
- Pytz 2022.7.1
- Numba 0.56.4
- Setuptools 65.6.3
- Llvmlite 0.39.1
Number of Bodies | Data File |
---|---|
8192 | DATA-8192.txt |
16384 | DATA-16384.txt |
32768 | DATA-32768.txt |
65536 | DATA-65536.txt |
131072 | DATA-131072.txt |
262144 | DATA-262144.txt |
Also see data.csv
OpenCL seems consistently slower than native code (NumPy on CPU, CUDA on GPU). Interestingly, the OpenCL CPU was faster than the NumPy CPU when p ≥ 65,536. At this point, it was approximately the same as the OpenCL GPU, suggesting that the speed might have been limited by the latency of random access memory.
The CUDA code has a trendline (ax^b) where b=0.978, meaning that the growth is less than linear. The most likely cause is the large relative inter-device data transfer times when the number of bodies is small, which have less of an effect when the compute time is greater. This also explains why NumPy is the fastest framework when p=8,192.
OpenCL on the CPU was the most consistent test group. The R^2 of the trendline was 0.999. This is likely due to the fact that it is compiled in C++ instead of being interpreted with Python or sent to other devices.
When the number of bodies was 16,384 ≤ p ≤ 65,536, some of the test groups were bimodal (as noted in the higher standard errors). The cause might be that some test groups had a higher frequency of collisions which took longer to calculate (since the initial positions and velocities were randomly chosen). The 8,192 group didn’t have many collisions since there were fewer particles, while the 131,072 and 262,144 groups had enough collisions that every trial encountered at least one.
Goals: 1) Get video for project display
As I was trying to get a video for the Synopsys presentation, I realised that I incorrectly did the calculations for calculating the component vectors of the force. I simplified it and updated the code. This should have little effect on the data. I used two models for the video with the following conditions:
- Colliding galaxies:
G=50000
k=0
E=100000
p=1024
t=1
i/f=500:1
- Orbiting bodies:
G=800
k=0
E=100000
p=4
t=1
i/f=30000:100
Goals: 1) Analyse data, 2) clean up code
For the code used for the data analysis, please see Analysis. I also cleaned up the code for better presentation, moving the kernels to seperate files, and added a small fix to force softening.
Goals: 1) Collect data.
Using a separate Python script, I’m looping through each of the four frameworks and testing one hundred trials of each, averaging the data, and sending it to my web server. The script will be run six times, once on each group (2^13, 2^14, 2^15, 2^16, 2^17, and 2^18 bodies). The parameters of the test script force one iteration, no rendering, and only a float
timestamp output. Constants are set to G=3,000
, k=0
, E=2^-128
, t=0.01
, and s=0.05
. Particle locations/positions are randomly generated, while mass/charge is set to one.
Goals: 1) Run preliminary performance benchmark
Run the program with each starting condition (4,096, 8,192, 16,384, 32,768, 65,536, and 131,072 bodies) once on each device. The output is discarded (as I only need the program runtime). Constants are set to G=3,000
, k=0
, E=2^-128
, t=0.01
, and s=0.05
. Particle locations/positions are randomly generated, while mass/charge is set to one.
Blank cells took >600 seconds
Number of Particles | GPU | CPU Multithread | CPU Single Thread |
---|---|---|---|
4096 | 1.47 | 0.57 | 48.66 |
8192 | 2.98 | 1.79 | 199.36 |
16384 | 11.83 | 5.86 | |
32768 | 21.35 | 21.16 | |
65536 | 50.17 | 114.79 | |
131072 | 110.26 |
This data has no statistical significance since only one trial was performed. However, the trendlines (ax^b) were around b=2 for most groups, as expected. The “CPU Multi” category ran as expected (exponential growth). The “GPU CUDA” category took longer than “CPU Multi” at the start, likely due to inter-device data transfer delays, eventually speeding up by 2^17 particles. The “CPU Single” took a lot longer than expected — instead of being 16x slower (as I have a 16-thread CPU), it was around 100x slower. I do not have time to use the CPU single thread in the final experiment.
Goals: 1) Enable pre-compilation of the NumPy
and CuPy
functions.
The NumPy
and CuPy
functions need to be compiled before they are run. Since this is done using JIT (just-in-time) compilation, it happens at the moment they are first called. This slows down the simulation and can introduce another variable to the results. By calling the functions once before the timer starts, and therefore compiling them early, the variable can be eliminated. See commit 5a13b73.
Goals: 1) Modify output of simulation to have an option to use Plotly
library
The variable DISPLAY
now has three states. "Video" keeps the current behaviour (plotting with matplotlib
and combining frames with FFmpeg
). "Plot" uses the Plotly
library to create an interactive web-browser based plot (note, the plot is stored to RAM, not to disk). "Both" makes a video and a plot. It creates a data frame with the Pandas
library that is used to form a 3D scatter plot. Plotly
is significantly faster than matplotlib
. See commit 82de489.
Goals: 1) Add multithreading using NumPy
, 2) clean up program code, 3) work on handling collisions
The code now has a better interface, including providing a progress bar. I also added multithreading for the NumPy
(CPU) N-body simulation. This was done using just-in-time (JIT) compilation provided by the Numba
library. I cleaned up some of the code to make it more readable. Lastly, I think I finally got collisions to work succesfully (if the distance is less than the value in the s
variable; collisions are assumed to be completely elastic). See commit 7234bc4.
Goals: 1) Implement the CuPy
kernel for force and acceleration calculations.
The kernel now works and is providing output to the CPU. End users can choose to set the global FRAMEWORK
variable to either "CuPy" or "NumPy", depending on how the simulation should be run. "NumPy" keeps it as it has been before, meanwhile "CuPy" runs the previously defined CUDA kernel. I also made some minor modifications to the output of the program, such as using the agg
backend of matplotlib
instead of TkAgg
for potentially better performance. See commit e124b5a.
Goals: 1) Begin programming CUDA kernel
Using the CuPy.RawKernel
function, I created a CUDA kernel called get_force
which is written using C++. The kernel, once compiled, will act in the same way as the getForce
function (which uses the NumPy
library). First, the function computes the process ID using the thread and block numbers (stored in the tid
variable). Then, the constants are redefined (as the Python code is not executed on GPU subunits). The acceleration components are calculated in an identical manner to getForce
and transmitted back to the CPU via a pointer. See commit f239416.
Goals: 1) Fix slow performance of NumPy
simulation
I optimized the computation of forces in two ways. First, I switched from using np.vectorize
to using np.frompyfunc
. The two are mostly equivalent, except for a couple features which are exclusive to vectorize
(see here). This brought a modest performance improvement, but for a more significant one, I merged the getForce
, xcomp
, ycomp
, and zcomp
functions all into one. Therefore, the amount of loops was cut in four, and less time was spent on recomputing data which had already been processed before (such as the angles). See commit 09904d7.
Goals: 1) Test performance of NumPy
simulation
I modified the end of the program to store three timestamps, the start of the run, the point when the simulation ends, and the point when processing ends. Then, the runtime is printed out to the Python CLI. I ran five trials with each program, setting the initial conditions to the following:
Iterations = 1e4
Frequency = 5e2
Particles = 10
Time constant = 1e-7
From the five trials performed, the performance with NumPy was 28.566±0.47 seconds, and the performance without was 1.126±0.005 seconds. The error bars represent twice the standard error of the mean, and do not overlap, so the data is significant. Unexpectedly, the NumPy library slowed down the performance a lot. Something in the new program is not functioning correctly, and requires debugging before moving on to adding the CuPy library which supports CUDA. See commit 242a687.
Goals: 1) Add animation to program.
Today, I defined the create_video
function, which takes an input of a list of frames, where each frame is a second list of coordinates, and uses Matplotlib
to save a 3D graph to C:\Nbody\files
. The frames are all strung together using FFmpeg
, and the frame rate is determined by the ratio of iterations to frequency. The images are all removed afterwards, and the video is stored at C:\Nbody\video.mp4
. See commit a56a6fe.
Goals: 1) Rewrite program using NumPy library
I worked for a few hours this morning to try and make some progress with NumPy. The original ParticleIntr
function has now been broken up into getForce
(to calculate the gross acceleration) and three other functions (to convert it into x, y, and z components). All four are vectorized. The data is stored in eight separate arrays, due to the way np.vectorize()
seems to work. The end result appears to parallel the original simulation. The next step is to store the results for the video animation. See commit e795856.
Goals: 1) Rewrite program using NumPy library
I'm continuing to convert the program to NumPy. Today, I worked to define the ParticleIntrNV
function, which is converted to a vectorized function ParticleIntr
using np.vectorize()
. The ParticleIntr
function takes an input of two particles, and calculates the forces between them using the distance formula, Newton's Law of Gravitation, Coulomb's Law, and trigonometry. See commit bc5e20b.
Goals: 1) Rewrite program using NumPy library
I'm beginning to write the program using the NumPy library. Because NumPy.vectorize
and NumPy.frompyfuunc
require simple mathematical operations, it will make it easier to be converted into a CUDA kernel or OpenCL kernel later. I'm keeping the constants the same as before, and storing particles in a 2D array, with the second dimension storing the eight datapoints of the particles. See commit ab0edfa.
Goals: 1) Review NumPy tutorials to rewrite program using NumPy library, 2) fill out Friday reflection for 2/12
No changes to the program today, but I worked to learn more about the Python NumPy
library, using these sources: 1, 2, 3, 4, 5
Goals: 1) Set up PC and install libraries
I installed Python 3.10.7
, CuPy 11.3.0
, Nvidia CUDA Toolkit 11.8
, FFmpeg
, and Matplotlib 3.6.2
, plus all necessary dependencies, onto my computer. I updated the code to run on a PC, for example by changing the shell script to parse the data with Matplotlib
. The directories also needed to be updated from /Users/rylandgoldman/...
to C:\Nbody\...
, because of the Windows filesystem. See commit 4f71716.
Goals: 1) Start status update presentation, 2) discuss notebook problems
I did not work on the project today. Instead, I worked on the presentation for my 4th status update. I also discussed changes I could make to my notebook for the next notebook check on 2022-12-06.
Goals: 1) Work more on collisions, 2) Measure performance
Today, I implemented support for collisions to the N-body simulation. The formula I used was p1.vx = (p1vx*(p1.m-p2.m)+2*p2.m*p2vx)/(p1.m+p2.m)
. After calculating the collision, it pushes the particles apart until they are out of bounds of each other. I also changed the calculation of angle alpha
in case r
is sufficiently close to zero to cause a ValueError
. See commit fc172b5.
I also tested the performance of a simulation with 1e5
iterations, 1e3
frequency, and 20 randomly-generated particles. The compute times were 68.468 seconds, 67.083 seconds, 68.791 seconds, 66.789 seconds, and 66.561 seconds, with a standard deviation of 1.019 and standard error of the mean of 0.456.
Goals: 1) Fill out Friday Reflection for 17/11, 2) complete journal write up #4
Today, I worked on my Journal Write Up 4 and filled out this week's Friday Reflection.