FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.

(click on images to show videos on YouTube)

Update History

v1.0 (04.08.2022) changes (public release)
- public release
v1.1 (29.09.2022) changes (GPU voxelization)
- added solid voxelization on GPU (slow algorithm)
- added tool to print current camera position (key G)
- minor bug fix (workaround for Intel iGPU driver bug with triangle rendering)
v1.2 (24.10.2022) changes (force/torque compuatation)
- added functions to compute force/torque on objects
- added function to translate Mesh
- added Stokes drag validation setup
v1.3 (10.11.2022) changes (minor bug fixes)
- added unit conversion functions for torque
- FORCE_FIELD and VOLUME_FORCE can now be used independently
- minor bug fix (workaround for AMD legacy driver bug with binary number literals)
v1.4 (14.12.2022) changes (Linux graphics)
- complete rewrite of C++ graphics library to minimize API dependencies
- added interactive graphics mode on Linux with X11
- fixed streamline visualization bug in 2D
v2.0 (09.01.2023) changes (multi-GPU upgrade)
- added (cross-vendor) multi-GPU support on a single node (PC/laptop/server)
v2.1 (15.01.2023) changes (fast voxelization)
- made solid voxelization on GPU lightning fast (new algorithm, from minutes to milliseconds)
v2.2 (20.01.2023) changes (velocity voxelization)
- added option to voxelize moving/rotating geometry on GPU, with automatic velocity initialization for each grid point based on center of rotation, linear velocity and rotational velocity
- cells that are converted from solid->fluid during re-voxelization now have their DDFs properly initialized
- added option to not auto-scale mesh during read_stl(...), with negative size parameter
- added kernel for solid boundary rendering with marching-cubes
v2.3 (30.01.2023) changes (particles)
- added particles with immersed-boundary method (either passive or 2-way-coupled, only supported with single-GPU)
- minor optimization to GPU voxelization algorithm (workgroup threads outside mesh bounding-box return after ray-mesh intersections have been found)
- displayed GPU memory allocation size is now fully accurate
- fixed bug in write_line() function in src/utilities.hpp
- removed .exe file extension for Linux/macOS
v2.4 (11.03.2023) changes (UI improvements)
- added a help menu with key H that shows keyboard/mouse controls, visualization settings and simulation stats
- improvements to keyboard/mouse control (+/- for zoom, mouseclick frees/locks cursor)
- added suggestion of largest possible grid resolution if resolution is set larger than memory allows
- minor optimizations in multi-GPU communication (insignificant performance difference)
- fixed bug in temperature equilibrium function for temperature extension
- fixed erroneous double literal for Intel iGPUs in skybox color functions
- fixed bug in make.sh where multi-GPU device IDs would not get forwarded to the executable
- minor bug fixes in graphics engine (free cursor not centered during rotation, labels in VR mode)
- fixed bug in LBM::voxelize_stl() size parameter standard initialization
v2.5 (11.04.2023) changes (raytracing overhaul)
- implemented light absorption in fluid for raytracing graphics (no performance impact)
- improved raytracing framerate when camera is inside fluid
- fixed skybox pole flickering artifacts
- fixed bug where moving objects during re-voxelization would leave an erroneous trail of solid grid cells behind
v2.6 (16.04.2023) changes (Intel Arc patch)
- patched OpenCL issues of Intel Arc GPUs: now VRAM allocations >4GB are possible and correct VRAM capacity is reported
v2.7 (29.05.2023) changes (visualization upgrade)
- added slice visualization (key 2 / key 3 modes, then switch through slice modes with key T, move slice with keys Q/E)
- made flag wireframe / solid surface visualization kernels toggleable with key 1
- added surface pressure visualization (key 1 when FORCE_FIELD is enabled and lbm.calculate_force_on_boundaries(); is called)
- added binary .vtk export function for meshes with lbm.write_mesh_to_vtk(Mesh* mesh);
- added time_step_multiplicator for integrate_particles() function in PARTICLES extension
- made correction of wrong memory reporting on Intel Arc more robust
- fixed bug in write_file() template functions
- reverted back to separate cl::Context for each OpenCL device, as the shared Context otherwise would allocate extra VRAM on all other unused Nvidia GPUs
- removed Debug and x86 configurations from Visual Studio solution file (one less complication for compiling)
- fixed bug that particles could get too close to walls and get stuck, or leave the fluid phase (added boundary force)
v2.8 (24.06.2023) changes (documentation + polish)
- finally added more documentation
- cleaned up all sample setups in setup.cpp for more beginner-friendliness, and added required extensions in defines.hpp as comments to all setups
- improved loading of composite .stl geometries, by adding an option to omit automatic mesh repositioning, added more functionality to Mesh struct in utilities.hpp
- added uint3 resolution(float3 box_aspect_ratio, uint memory) function to compute simulation box resolution based on box aspect ratio and VRAM occupation in MB
- added bool lbm.graphics.next_frame(...) function to export images for a specified video length in the main_setup compute loop
- added VIS_... macros to ease setting visualization modes in headless graphics mode in lbm.graphics.visualization_modes
- simulation box dimensions are now automatically made equally divisible by domains for multi-GPU simulations
- fixed Info/Warning/Error message formatting for loading files and made Info/Warning/Error message labels colored
- added Ahmed body setup as an example on how body forces and drag coefficient are computed
- added Cessna 172 and Bell 222 setups to showcase loading composite .stl geometries and revoxelization of moving parts
- added optional semi-transparent rendering mode (#define GRAPHICS_TRANSPARENCY 0.7f in defines.hpp)
- fixed flickering of streamline visualization in interactive graphics
- improved smooth positioning of streamlines in slice mode
- fixed bug where mass and massex in SURFACE extension were also allocated in CPU RAM (not required)
- fixed bug in Q-criterion rendering of halo data in multi-GPU mode, reduced gap width between domains
- removed shared memory optimization from mesh voxelization kernel, as it crashes on Nvidia GPUs with new GPU drivers and is incompatible with old OpenCL 1.0 GPUs
- fixed raytracing attenuation color when no surface is at the simulation box walls with periodic boundaries
v2.9 (31.07.2023) changes (multithreading)
- added cross-platform parallel_for implementation in utilities.hpp using std::threads
- significantly (>4x) faster simulation startup with multithreaded geometry initialization and sanity checks
- faster calculate_force_on_object() and calculate_torque_on_object() functions with multithreading
- added total runtime and LBM runtime to lbm.write_status()
- fixed bug in voxelization ray direction for re-voxelizing rotating objects
- fixed bug in Mesh::get_bounding_box_size()
- fixed bug in print_message() function in utilities.hpp
v2.10 (05.11.2023) changes (frustrum culling)
- improved rasterization performance via frustrum culling when only part of the simulation box is visible
- improved switching between centered/free camera mode
- refactored OpenCL rendering library
- unit conversion factors are now automatically printed in console when units.set_m_kg_s(...) is used
- faster startup time for FluidX3D benchmark
- miner bug fix in voxelize_mesh(...) kernel
- fixed bug in shading(...)
- replaced slow (in multithreading) std::rand() function with standard C99 LCG
- more robust correction of wrong VRAM capacity reporting on Intel Arc GPUs
- fixed some minor compiler warnings
v2.11 (07.12.2023) changes (improved Linux graphics)
- interactive graphics on Linux are now in fullscreen mode too, fully matching Windows
- made CPU/GPU buffer initialization significantly faster with std::fill and enqueueFillBuffer (overall ~8% faster simulation startup)
- added operating system info to OpenCL device driver version printout
- fixed flickering with frustrum culling at very small field of view
- fixed bug where rendered/exported frame was not updated when visualization_modes changed
v2.12 (18.01.2024) changes (faster startup)
- ~3x faster source code compiling on Linux using multiple CPU cores if make is installed
- significantly faster simulation initialization (~40% single-GPU, ~15% multi-GPU)
- minor bug fix in Memory_Container::reset() function
v2.13 (11.02.2024) changes (improved .vtk export)
- data in exported .vtk files is now automatically converted to SI units
- ~2x faster .vtk export with multithreading
- added unit conversion functions for TEMPERATURE extension
- fixed graphical artifacts with axis-aligned camera in raytracing
- fixed get_exe_path() for macOS
- fixed X11 multi-monitor issues on Linux
- workaround for Nvidia driver bug: enqueueFillBuffer is broken for large buffers on Nvidia GPUs
- fixed slow numeric drift issues caused by -cl-fast-relaxed-math
- fixed wrong Maximum Allocation Size reporting in LBM::write_status()
- fixed missing scaling of coordinates to SI units in LBM::write_mesh_to_vtk()
v2.14 (03.03.2024) changes (visualization upgrade)
- coloring can now be switched between velocity/density/temperature with key Z
- uniform improved color palettes for velocity/density/temperature visualization
- color scale with automatic unit conversion can now be shown with key H
- slice mode for field visualization now draws fully filled-in slices instead of only lines for velocity vectors
- shading in VIS_FLAG_SURFACE and VIS_PHI_RASTERIZE modes is smoother now
- make.sh now automatically detects operating system and X11 support on Linux and only runs FluidX3D if last compilation was successful
- fixed compiler warnings on Android
- fixed make.sh failing on some systems due to nonstandard interpreter path
- fixed that make would not compile with multiple cores on some systems
v2.15 (09.04.2024) changes (framerate boost)
- eliminated one frame memory copy and one clear frame operation in rendering chain, for 20-70% higher framerate on both Windows and Linux
- enabled g++ compiler optimizations for faster startup and higher rendering framerate
- fixed bug in multithreaded sanity checks
- fixed wrong unit conversion for thermal expansion coefficient
- fixed density to pressure conversion in LBM units
- fixed bug that raytracing kernel could lock up simulation
- fixed minor visual artifacts with raytracing
- fixed that console sometimes was not cleared before INTERACTIVE_GRAPHICS_ASCII rendering starts
v2.16 (02.05.2024) changes (bug fixes)
- simplified 10% faster marching-cubes implementation with 1D interpolation on edges instead of 3D interpolation, allowing to get rid of edge table
- added faster, simplified marching-cubes variant for solid surface rendering where edges are always halfway between grid cells
- refactoring in OpenCL rendering kernels
- fixed that voxelization failed in Intel OpenCL CPU Runtime due to array out-of-bounds access
- fixed that voxelization did not always produce binary identical results in multi-GPU compared to single-GPU
- fixed that velocity voxelization failed for free surface simulations
- fixed terrible performance on ARM GPUs by macro-replacing fused-multiply-add (fma) with a*b+c
- fixed that Y/Z keys were incorrect for QWERTY keyboard layout in Linux
- fixed that free camera movement speed in help overlay was not updated in stationary image when scrolling
- fixed that cursor would sometimes flicker when scrolling on trackpads with Linux-X11 interactive graphics
- fixed flickering of interactive rendering with multi-GPU when camera is not moved
- fixed missing XInitThreads() call that could crash Linux interactive graphics on some systems
- fixed z-fighting between graphics_rasterize_phi() and graphics_flags_mc() kernels
v2.17 (05.06.2024) changes (unlimited domain resolution)
- domains are no longer limited to 4.29 billion (2³², 1624³) grid cells or 225 GB memory; if more are used, the OpenCL code will automatically compile with 64-bit indexing
- new, faster raytracing-based field visualization for single-GPU simulations
- added GPU Driver and OpenCL Runtime installation instructions to documentation
- refactored INTERACTIVE_GRAPHICS_ASCII
- fixed memory leak in destructors of floatN, floatNxN, doubleN, doubleNxN (all unused)
- made camera movement/rotation/zoom behavior independent of framerate
- fixed that smart_device_selection() would print a wrong warning if device reports 0 MHz clock speed
v2.18 (21.07.2024) changes (more bug fixes)
- added support for high refresh rate monitors on Linux
- more compact OpenCL Runtime installation scripts in Documentation
- driver/runtime installation instructions will now be printed to console if no OpenCL devices are available
- added domain information to LBM::write_status()
- added LBM::index function for uint3 input parameter
- fixed that very large simulations sometimes wouldn't render properly by increasing maximum render distance from 10k to 2.1M
- fixed mouse input stuttering at high screen refresh rate on Linux
- fixed graphical artifacts in free surface raytracing on Intel CPU Runtime for OpenCL
- fixed runtime estimation printed in console for setups with multiple lbm.run(...) calls
- fixed density oscillations in sample setups (too large lbm_u)
- fixed minor graphical artifacts in raytrace_phi()
- fixed minor graphical artifacts in ray_grid_traverse_sum()
- fixed wrong printed time step count on raindrop sample setup
v2.19 (07.09.2024) changes (camera splines)
- the camera can now fly along a smooth path through a list of provided keyframe camera placements, using Catmull-Rom splines
- more accurate remaining runtime estimation that includes time spent on rendering
- enabled FP16S memory compression by default
- printed camera placement using key G is now formatted for easier copy/paste
- added benchmark chart in Readme using mermaid gantt chart
- placed memory allocation info during simulation startup at better location
- fixed threading conflict between INTERACTIVE_GRAPHICS and lbm.graphics.write_frame();
- fixed maximum buffer allocation size limit for AMD GPUs and in Intel CPU Runtime for OpenCL
- fixed wrong Re<Re_max info printout for 2D simulations
- minor fix in bandwidth_bytes_per_cell_device()
v3.0 (16.11.2024) changes (larger CPU/iGPU simulations)
- reduced memory footprint on CPUs and iGPU from 72 to 55 Bytes/cell (fused OpenCL host+device buffers for rho/u/flags), allowing 31% higher resolution in the same RAM capacity
- faster hardware-supported and faster fallback emulation atomic floating-point addition for PARTICLES extension
- hardened calculate_f_eq() against bad user input for D2Q9
- fixed velocity voxelization for overlapping geometry with different velocity
- fixed Remaining Time printout during paused simulation
- fixed CPU/GPU memory printout for CPU/iGPU simulations
v3.1 (08.02.2025) changes (more bug fixes)
- faster enqueueReadBuffer() on modern CPUs with 64-Byte-aligned host_buffer
- hardened ray intersection functions against planar ray edge case
- updated OpenCL headers
- better OpenCL device specs detection using vendor ID and Nvidia compute capability
- better VRAM capacity reporting correction for Intel dGPUs
- improved styling of performance mermaid gantt chart in Readme
- added multi-GPU performance mermaid gantt chart in Readme
- updated driver install guides
- fixed voxelization being broken on some GPUs
- added workaround for compiler bug in Intel CPU Runtime for OpenCL that causes Q-criterion isosurface rendering corruption
- fixed TFlops estimate for Intel Battlemage GPUs
- fixed wrong device name reporting for AMD GPUs
v3.2 (09.03.2025) changes (fast force/torque summation)
- implemented GPU-accelerated force/torque summation (~20x faster than CPU-multithreaded implementation before)
- simplified calculating object force/torque in setups
- improved coloring in VIS_FIELD/ray_grid_traverse_sum()
- updated OpenCL-Wrapper now compiles OpenCL C code with -cl-std=CL3.0 if available
- fixed compiling on macOS with new OpenCL headers
v3.3 (17.05.2025) changes (faster .vtk export)
- .vtk export now converts and writes data in chunks, to reduce memory footprint and time for large memory allocation
- .vtk files now contain original file name as metadata in title
- INTERACTIVE_GRAPHICS_ASCII now renders in 2x vertical resolution but less colors
- updated OpenCL-Wrapper: more robust dp4a detection, fixed core count reporting for RDNA4 GPUs
- fixed update_moving_boundaries() kernel not being called with flags other than TYPE_S
- fixed corrupted first frame until resizing with INTERACTIVE_GRAPHICS_ASCII
- fixed resolution() function for D2Q9
- fixed missing <chrono> header on some compilers
- fixed bug in split_regex()
- fixed compiler warning with min_int
v3.4 (02.07.2025) changes (bug fixes)
- updated OpenCL driver install versions
- minor refactoring in stream_collide()
- fixed bug in insertion-sort in voxelize_mesh() kernel causing crash on AMD GPUs
- fixed bug in voxelize_mesh_on_device() host code causing initialization corruption on AMD GPUs
- fixed dual CU and IPC reporting on AMD RDNA 1-4 GPUs
v3.5 (01.10.2025) changes (multi-GPU particles)
- PARTICLES extension now also works with multi-GPU
- faster force spreading if volume force is axis-aligned
- added more documentation for boundary conditions
- updated FAQs
- improved "hydraulic jump" sample setup
- updated GPU driver install instructions
- disabled zero-copy on ARM iGPUs because CL_MEM_USE_HOST_PTR is broken there

How to get started?

Read the FluidX3D Documentation!

Compute Features - Getting the Memory Problem under Control

CFD model: lattice Boltzmann method (LBM)

streaming (part 2/2)
f₀^temp(x,t) = f₀(x, t)
f_i^temp(x,t) = f_{(t%2 ? i : (i%2 ? i+1 : i-1))}(i%2 ? x : x-e_i, t) for i ∈ [1, q-1]
collision
ρ(x,t) = (Σ_i f_i^temp(x,t)) + 1

u(x,t) = ¹∕_ρ(x,t) Σ_i c_i f_i^temp(x,t)

f_i^eq-shifted(x,t) = w_i ρ · (^(u_°c_i)²∕_(2c⁴) - ^(u_°u)∕_(2c²) + ^(u_°c_i)∕_c²) + w_i (ρ-1)

f_i^temp(x, t+Δt) = f_i^temp(x,t) + Ω_i(f_i^temp(x,t), f_i^eq-shifted(x,t), τ)
streaming (part 1/2)
f₀(x, t+Δt) = f₀^temp(x, t+Δt)
f_{(t%2 ? (i%2 ? i+1 : i-1) : i)}(i%2 ? x+e_i : x, t+Δt) = f_i^temp(x, t+Δt) for i ∈ [1, q-1]

variables and notation

variable	SI units	defining equation	description

x	m	x = (x,y,z)^T	3D position in Cartesian coordinates
t	s	-	time
ρ	^kg∕_m³	ρ = (Σ_i f_i)+1	mass density of fluid
p	^kg∕_m s²	p = c² ρ	pressure of fluid
u	^m∕_s	u = ¹∕_ρ Σ_i c_i f_i	velocity of fluid
ν	^m²∕_s	ν = ^μ∕_ρ	kinematic shear viscosity of fluid
μ	^kg∕_m s	μ = ρ ν	dynamic viscosity of fluid

f_i	^kg∕_m³	-	shifted density distribution functions (DDFs)
Δx	m	Δx = 1	lattice constant (in LBM units)
Δt	s	Δt = 1	simulation time step (in LBM units)
c	^m∕_s	c = ¹∕_√3 ^Δx∕_Δt	lattice speed of sound (in LBM units)
i	1	0 ≤ i < q	LBM streaming direction index
q	1	q ∈ { 9,15,19,27 }	number of LBM streaming directions
e_i	m	D2Q9 / D3Q15/19/27	LBM streaming directions
c_i	^m∕_s	c_i = ^e_i∕_Δt	LBM streaming velocities
w_i	1	Σ_i w_i = 1	LBM velocity set weights
Ω_i	^kg∕_m³	SRT or TRT	LBM collision operator
τ	s	τ = ^ν∕_c² + ^Δt∕₂	LBM relaxation time

velocity sets: D2Q9, D3Q15, D3Q19 (default), D3Q27
collision operators: single-relaxation-time (SRT/BGK) (default), two-relaxation-time (TRT)
DDF-shifting and other algebraic optimization to minimize round-off error

optimized to minimize VRAM footprint to 1/6 of other LBM codes

traditional LBM (D3Q19) with FP64 requires ~344 Bytes/cell
- 🟧🟧🟧🟧🟧🟧🟧🟧🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟨🟨🟨🟨🟨🟨🟨🟨🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥
  (density 🟧, velocity 🟦, flags 🟨, 2 copies of DDFs 🟩/🟥; each square = 1 Byte)
- allows for 3 Million cells per 1 GB VRAM
FluidX3D (D3Q19) requires only 55 Bytes/cell with Esoteric-Pull+FP16
- 🟧🟧🟧🟧🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟨🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩
  (density 🟧, velocity 🟦, flags 🟨, DDFs 🟩; each square = 1 Byte)
- allows for 19 Million cells per 1 GB VRAM
- in-place streaming with Esoteric-Pull: eliminates redundant copy of density distribution functions (DDFs) in memory; almost cuts memory demand in half and slightly increases performance due to implicit bounce-back boundaries; offers optimal memory access patterns for single-cell in-place streaming
- decoupled arithmetic precision (FP32) and memory precision (FP32 or FP16S or FP16C): all arithmetic is done in FP32 for compatibility on all hardware, but DDFs in memory can be compressed to FP16S or FP16C: almost cuts memory demand in half again and almost doubles performance, without impacting overall accuracy for most setups
- only 8 flag bits per lattice point (can be used independently / at the same time)
  - TYPE_S (stationary or moving) solid boundaries
  - TYPE_E equilibrium boundaries (inflow/outflow)
  - TYPE_T temperature boundaries
  - TYPE_F free surface (fluid)
  - TYPE_I free surface (interface)
  - TYPE_G free surface (gas)
  - TYPE_X remaining for custom use or further extensions
  - TYPE_Y remaining for custom use or further extensions

large cost saving: comparison of maximum single-GPU grid resolution for D3Q19 LBM

GPU VRAM capacity	1 GB	2 GB	3 GB	4 GB	6 GB	8 GB	10 GB	11 GB	12 GB	16 GB	20 GB	24 GB	32 GB	40 GB	48 GB	64 GB	80 GB	94 GB	128 GB	180 GB	192 GB	256 GB	288 GB	6 TB
approximate GPU price	$25 GT 210	$25 GTX 950	$12 GTX 1060	$50 GT 730	$35 GTX 1060	$70 RX 470	$500 RTX 3080	$240 GTX 1080 Ti	$75 Tesla M40	$75 Instinct MI25	$900 RX 7900 XT	$205 Tesla P40	$600 Instinct MI60	$5500 A100	$2400 RTX 8000	$10k Instinct MI210	$11k A100	>$40k H100 NVL	? GPU Max 1550	~$80k B200	~$15k MI300X	~$25k MI325X	~$25k MI350X	~$80k 2x Xeon 6960P
traditional LBM (FP64)	144³	182³	208³	230³	262³	288³	312³	322³	330³	364³	392³	418³	460³	494³	526³	578³	624³	658³	730³	818³	836³	920³	956³	2654³
FluidX3D (FP32/FP32)	224³	282³	322³	354³	406³	448³	482³	498³	512³	564³	608³	646³	710³	766³	814³	896³	966³	1018³	1130³	1266³	1292³	1422³	1480³	4106³
FluidX3D (FP32/FP16)	266³	336³	384³	424³	484³	534³	574³	594³	610³	672³	724³	770³	848³	912³	970³	1068³	1150³	1214³	1346³	1508³	1540³	1696³	1764³	4892³

cross-vendor multi-GPU support on a single computer/server

domain decomposition allows pooling VRAM from multiple GPUs for much larger grid resolution
GPUs don't have to be identical, not even from the same vendor - any combination of AMD+Intel+Nvidia GPUs will work - but similar VRAM capacity/bandwidth is recommended

domain communication architecture (simplified)

++   .-----------------------------------------------------------------.   ++
++   |                              GPU 0                              |   ++
++   |                          LBM Domain 0                           |   ++
++   '-----------------------------------------------------------------'   ++
++              |                 selective                /|\             ++
++             \|/               in-VRAM copy               |              ++
++        .-------------------------------------------------------.        ++
++        |               GPU 0 - Transfer Buffer 0               |        ++
++        '-------------------------------------------------------'        ++
!!                            |     PCIe     /|\                           !!
!!                           \|/    copy      |                            !!
@@        .-------------------------.   .-------------------------.        @@
@@        | CPU - Transfer Buffer 0 |   | CPU - Transfer Buffer 1 |        @@
@@        '-------------------------'\ /'-------------------------'        @@
@@                           pointer  X   swap                             @@
@@        .-------------------------./ \.-------------------------.        @@
@@        | CPU - Transfer Buffer 1 |   | CPU - Transfer Buffer 0 |        @@
@@        '-------------------------'   '-------------------------'        @@
!!                           /|\    PCIe      |                            !!
!!                            |     copy     \|/                           !!
++        .-------------------------------------------------------.        ++
++        |               GPU 1 - Transfer Buffer 1               |        ++
++        '-------------------------------------------------------'        ++
++             /|\                selective                 |              ++
++              |                in-VRAM copy              \|/             ++
++   .-----------------------------------------------------------------.   ++
++   |                              GPU 1                              |   ++
++   |                          LBM Domain 1                           |   ++
++   '-----------------------------------------------------------------'   ++
##                                    |                                    ##
##                      domain synchronization barrier                     ##
##                                    |                                    ##
||   -------------------------------------------------------------> time   ||

domain communication architecture (detailed)

++   .-----------------------------------------------------------------.   ++
++   |                              GPU 0                              |   ++
++   |                          LBM Domain 0                           |   ++
++   '-----------------------------------------------------------------'   ++
++     |  selective in- /|\  |  selective in- /|\  |  selective in- /|\    ++
++    \|/ VRAM copy (X)  |  \|/ VRAM copy (Y)  |  \|/ VRAM copy (Z)  |     ++
++   .---------------------.---------------------.---------------------.   ++
++   |    GPU 0 - TB 0X+   |    GPU 0 - TB 0Y+   |    GPU 0 - TB 0Z+   |   ++
++   |    GPU 0 - TB 0X-   |    GPU 0 - TB 0Y-   |    GPU 0 - TB 0Z-   |   ++
++   '---------------------'---------------------'---------------------'   ++
!!          | PCIe /|\            | PCIe /|\            | PCIe /|\         !!
!!         \|/ copy |            \|/ copy |            \|/ copy |          !!
@@   .---------. .---------.---------. .---------.---------. .---------.   @@
@@   | CPU 0X+ | | CPU 1X- | CPU 0Y+ | | CPU 3Y- | CPU 0Z+ | | CPU 5Z- |   @@
@@   | CPU 0X- | | CPU 2X+ | CPU 0Y- | | CPU 4Y+ | CPU 0Z- | | CPU 6Z+ |   @@
@@   '---------\ /---------'---------\ /---------'---------\ /---------'   @@
@@      pointer X swap (X)    pointer X swap (Y)    pointer X swap (Z)     @@
@@   .---------/ \---------.---------/ \---------.---------/ \---------.   @@
@@   | CPU 1X- | | CPU 0X+ | CPU 3Y- | | CPU 0Y+ | CPU 5Z- | | CPU 0Z+ |   @@
@@   | CPU 2X+ | | CPU 0X- | CPU 4Y+ | | CPU 0Y- | CPU 6Z+ | | CPU 0Z- |   @@
@@   '---------' '---------'---------' '---------'---------' '---------'   @@
!!         /|\ PCIe |            /|\ PCIe |            /|\ PCIe |          !!
!!          | copy \|/            | copy \|/            | copy \|/         !!
++   .--------------------..---------------------..--------------------.   ++
++   |   GPU 1 - TB 1X-   ||    GPU 3 - TB 3Y-   ||   GPU 5 - TB 5Z-   |   ++
++   :====================::=====================::====================:   ++
++   |   GPU 2 - TB 2X+   ||    GPU 4 - TB 4Y+   ||   GPU 6 - TB 6Z+   |   ++
++   '--------------------''---------------------''--------------------'   ++
++    /|\ selective in-  |  /|\ selective in-  |  /|\ selective in-  |     ++
++     |  VRAM copy (X) \|/  |  VRAM copy (Y) \|/  |  VRAM copy (Z) \|/    ++
++   .--------------------..---------------------..--------------------.   ++
++   |        GPU 1       ||        GPU 3        ||        GPU 5       |   ++
++   |    LBM Domain 1    ||    LBM Domain 3     ||    LBM Domain 5    |   ++
++   :====================::=====================::====================:   ++
++   |        GPU 2       ||        GPU 4        ||        GPU 6       |   ++
++   |    LBM Domain 2    ||    LBM Domain 4     ||    LBM Domain 6    |   ++
++   '--------------------''---------------------''--------------------'   ++
##              |                     |                     |              ##
##              |      domain synchronization barriers      |              ##
##              |                     |                     |              ##
||   -------------------------------------------------------------> time   ||

peak performance on GPUs (datacenter/gaming/professional/laptop)
- single-GPU/CPU benchmarks
- multi-GPU benchmarks
powerful model extensions
- boundary types
  - stationary mid-grid bounce-back boundaries (stationary solid boundaries)
  - moving mid-grid bounce-back boundaries (moving solid boundaries)
  - equilibrium boundaries (non-reflective inflow/outflow)
  - temperature boundaries (fixed temperature)
- global force per volume (Guo forcing), can be modified on-the-fly
- local force per volume (force field)
  - optional computation of forces from the fluid on solid boundaries
- state-of-the-art free surface LBM (FSLBM) implementation:
  - volume-of-fluid model
  - fully analytic PLIC for efficient curvature calculation
  - improved mass conservation
  - ultra efficient implementation with only 4 kernels additionally to stream_collide() kernel
- thermal LBM to simulate thermal convection
  - D3Q7 subgrid for thermal DDFs
  - in-place streaming with Esoteric-Pull for thermal DDFs
  - optional FP16S or FP16C compression for thermal DDFs with DDF-shifting
- Smagorinsky-Lilly subgrid turbulence LES model to keep simulations with very large Reynolds number stable
  Π_αβ = Σ_i e_iα e_iβ (f_i - f_i^eq-shifted)
  
  Q = Σ_αβ Π_αβ²
  ______________________
  τ = ½ (τ₀ + √ τ₀² + ^(16√2)∕_(3π²) ^√Q∕_ρ )
- particles with immersed-boundary method (either passive or 2-way-coupled)

Solving the Visualization Problem

FluidX3D can do simulations so large that storing the volumetric data for later rendering becomes unmanageable (like 120GB for a single frame, hundreds of TeraByte for a video)
instead, FluidX3D allows rendering raw simulation data directly in VRAM, so no large volumetric files have to be exported to the hard disk (see my technical talk)
the rendering is so fast that it works interactively in real time for both rasterization and raytracing
rasterization and raytracing are done in OpenCL and work on all GPUs, even the ones without RTX/DXR raytracing cores or without any rendering hardware at all (like A100, MI200, ...)
if no monitor is available (like on a remote Linux server), there is an ASCII rendering mode to interactively visualize the simulation in the terminal (even in WSL and/or through SSH)
rendering is fully multi-GPU-parallelized via seamless domain decomposition rasterization
with interactive graphics mode disabled, image resolution can be as large as VRAM allows for (4K/8K/16K and above)
(interacitive) visualization modes:
- flag wireframe / solid surface (and force vectors on solid cells or surface pressure if the extension is used)
- velocity field (with slice mode)
- streamlines (with slice mode)
- velocity-colored Q-criterion isosurface
- rasterized free surface with marching-cubes
- raytraced free surface with fast ray-grid traversal and marching-cubes, either 1-4 rays/pixel or 1-10 rays/pixel

Solving the Compatibility Problem

FluidX3D is written in OpenCL, so it runs on all hardware from all vendors (Nvidia, AMD, Intel, ...):
- world's fastest datacenter GPUs: B200, MI300X, H200, H100 (NVL), A100, MI200, MI100, V100(S), GPU Max 1100, ...
- gaming GPUs (desktop/laptop): Nvidia GeForce, AMD Radeon, Intel Arc
- professional/workstation GPUs: Nvidia Quadro, AMD Radeon Pro / FirePro, Intel Arc Pro
- integrated GPUs
- CPUs (requires installation of Intel CPU Runtime for OpenCL)
- Intel Xeon Phi (requires installation of Intel CPU Runtime for OpenCL)
- smartphone ARM GPUs
native cross-vendor multi-GPU implementation
- uses PCIe communication, so no SLI/Crossfire/NVLink/InfinityFabric required
- single-node parallelization, so no MPI installation required
- GPUs don't even have to be from the same vendor, but similar memory capacity and bandwidth are recommended
works on Windows and Linux with C++17, with limited support also for macOS and Android
supports importing and voxelizing triangle meshes from binary .stl files, with fast GPU voxelization
supports exporting volumetric data as binary .vtk files
supports exporting triangle meshes as binary .vtk files
supports exporting rendered images as .png/.qoi/.bmp files; encoding runs in parallel on the CPU while the simulation on GPU can continue without delay

Single-GPU/CPU Benchmarks

Here are performance benchmarks on various hardware in MLUPs/s, or how many million lattice cells are updated per second. The settings used for the benchmark are D3Q19 SRT with no extensions enabled (only LBM with implicit mid-grid bounce-back boundaries) and the setup consists of an empty cubic box with sufficient size (typically 256³). Without extensions, a single lattice cell requires:

a memory capacity of 93 (FP32/FP32) or 55 (FP32/FP16) Bytes
a memory bandwidth of 153 (FP32/FP32) or 77 (FP32/FP16) Bytes per time step
363 (FP32/FP32) or 406 (FP32/FP16S) or 1275 (FP32/FP16C) FLOPs per time step (FP32+INT32 operations counted combined)

In consequence, the arithmetic intensity of this implementation is 2.37 (FP32/FP32) or 5.27 (FP32/FP16S) or 16.56 (FP32/FP16C) FLOPs/Byte. So performance is only limited by memory bandwidth. The table in the left 3 columns shows the hardware specs as found in the data sheets (theoretical peak FP32 compute performance, memory capacity, theoretical peak memory bandwidth). The right 3 columns show the measured FluidX3D performance for FP32/FP32, FP32/FP16S, FP32/FP16C floating-point precision settings, with the (roofline model efficiency) in round brackets, indicating how much % of theoretical peak memory bandwidth are being used.

If your GPU/CPU is not on the list yet, you can report your benchmarks here.

gantt

title FluidX3D Performance [MLUPs/s] - FP32 arithmetic, (fastest of FP32/FP16S/FP16C) memory storage
dateFormat X
axisFormat %s
%%{
	init: {
		"gantt": {
			'titleTopMargin': 42,
			'topPadding': 70,
			'leftPadding': 260,
			'rightPadding': 5,
			'sectionFontSize': 20,
			'fontSize': 20,
			'barHeight': 20,
			'barGap': 3,
			'numberSectionStyles': 2
		},
		'theme': 'forest',
		'themeVariables': {
			'sectionBkgColor': '#99999999',
			'altSectionBkgColor': '#00000000',
			'titleColor': '#AFAFAF',
			'textColor': '#AFAFAF',
			'taskTextColor': 'black',
			'taskBorderColor': '#487E3A'
		}
	}
}%%

section MI300X
	41327 :crit, 0, 41327
section MI250 (1 GCD)
	9030 :crit, 0, 9030
section MI210
	9547 :crit, 0, 9547
section MI100
	8542 :crit, 0, 8542
section MI60
	5111 :crit, 0, 5111
section MI50 32GB
	8477 :crit, 0, 8477
section Radeon VII
	7778 :crit, 0, 7778
section GPU Max 1100
	6303 :done, 0, 6303
section B200 SXM6 180GB
	55609 : 0, 55609
section H200 SXM5 141GB
	36610 : 0, 36610
section GH200 94GB GPU
	34689 : 0, 34689
section H100 NVL
	32922 : 0, 32922
section H100 SXM5 80GB HBM3
	29561 : 0, 29561
section H100 PCIe 80GB HBM2e
	20624 : 0, 20624
section A100 SXM4 80GB
	18448 : 0, 18448
section A100 PCIe 80GB
	17896 : 0, 17896
section PG506-242/243
	15654 : 0, 15654
section A100 SXM4 40GB
	16013 : 0, 16013
section A100 PCIe 40GB
	16035 : 0, 16035
section CMP 170HX
	12392 : 0, 12392
section A30
	9721 : 0, 9721
section V100 SXM2 32GB
	8947 : 0, 8947
section V100 PCIe 16GB
	10325 : 0, 10325
section GV100
	6641 : 0, 6641
section Titan V
	7253 : 0, 7253
section P100 PCIe 16GB
	5950 : 0, 5950
section P100 PCIe 12GB
	4141 : 0, 4141
section GTX TITAN
	2500 : 0, 2500
section K40m
	1868 : 0, 1868
section K80 (1 GPU)
	1642 : 0, 1642
section K20c
	1507 : 0, 1507

section RX 9070 XT
	6688 :crit, 0, 6688
section RX 9070
	6019 :crit, 0, 6019
section RX 7900 XTX
	7716 :crit, 0, 7716
section PRO W7900
	5939 :crit, 0, 5939
section RX 7900 XT
	5986 :crit, 0, 5986
section RX 7800 XT
	3105 :crit, 0, 3105
section PRO W7800
	4426 :crit, 0, 4426
section RX 7900 GRE
	4570 :crit, 0, 4570
section PRO W7700
	2943 :crit, 0, 2943
section RX 7700 XT
	2828 :crit, 0, 2828
section RX 7600
	2561 :crit, 0, 2561
section PRO W7600
	2287 :crit, 0, 2287
section PRO W7500
	1682 :crit, 0, 1682
section RX 6900 XT
	4227 :crit, 0, 4227
section RX 6800 XT
	4241 :crit, 0, 4241
section PRO W6800
	3361 :crit, 0, 3361
section RX 6700 XT
	2908 :crit, 0, 2908
section RX 6750 GRE 12GB
	2848 :crit, 0, 2848
section RX 6800M
	3213 :crit, 0, 3213
section RX 6700M
	2429 :crit, 0, 2429
section RX 6600
	1839 :crit, 0, 1839
section RX 6500 XT
	1030 :crit, 0, 1030
section RX 5700 XT
	3253 :crit, 0, 3253
section RX 5700
	3167 :crit, 0, 3167
section RX 5600 XT
	2214 :crit, 0, 2214
section RX Vega 64
	3227 :crit, 0, 3227
section RX 590
	1688 :crit, 0, 1688
section RX 580 4GB
	1848 :crit, 0, 1848
section RX 580 2048SP 8GB
	1622 :crit, 0, 1622
section RX 480 8GB
	1908 :crit, 0, 1908
section Pro WX 5100
	1604 :crit, 0, 1604
section R9 Fury X
	2880 :crit, 0, 2880
section R9 Nano
	2761 :crit, 0, 2761
section R9 390X
	2217 :crit, 0, 2217
section R9 290X
	1699 :crit, 0, 1699
section R9 290
	1647 :crit, 0, 1647
section HD 7970
	1563 :crit, 0, 1563
section HD 7870
	868 :crit, 0, 868
section HD 7850
	635 :crit, 0, 635
section HD 5870
	257 :crit, 0, 257
section HD 5770
	132 :crit, 0, 132
section FirePro V5800
	108 :crit, 0, 108
section Arc B580 LE
	5153 :done, 0, 5153
section Arc B570
	4186 :done, 0, 4186
section Arc A770 LE
	4568 :done, 0, 4568
section Arc A750 LE
	4314 :done, 0, 4314
section Arc A580
	3889 :done, 0, 3889
section Arc Pro A40
	985 :done, 0, 985
section Arc A380
	1115 :done, 0, 1115
section RTX PRO 6000 Blackwell
	20841 : 0, 20841
section RTX 5090
	19141 : 0, 19141
section RTX 5080
	10304 : 0, 10304
section RTX 5070
	7238 : 0, 7238
section RTX 4090
	11496 : 0, 11496
section RTX 6000 Ada
	10293 : 0, 10293
section L40S
	7637 : 0, 7637
section L40
	7945 : 0, 7945
section RTX 4080 Super
	8218 : 0, 8218
section RTX 4080
	7933 : 0, 7933
section RTX 4070 Ti Super
	7295 : 0, 7295
section RTX 4090M
	6901 : 0, 6901
section RTX 4070 Super
	5554 : 0, 5554
section RTX 4070
	5016 : 0, 5016
section RTX 4080M
	5114 : 0, 5114
section RTX 4000 Ada
	4221 : 0, 4221
section L4
	2857 : 0, 2857
section RTX 4060
	3124 : 0, 3124
section RTX 4070M
	3092 : 0, 3092
section RTX 2000 Ada
	2526 : 0, 2526
section RTX 3090 Ti
	10956 : 0, 10956
section RTX 3090
	10732 : 0, 10732
section RTX 3080 Ti
	9832 : 0, 9832
section RTX 3080 12GB
	9657 : 0, 9657
section RTX A6000
	8814 : 0, 8814
section RTX A5000
	8617 : 0, 8617
section RTX 3080 10GB
	8118 : 0, 8118
section RTX A40
	6622 : 0, 6622
section RTX 3070 Ti
	6807 : 0, 6807
section RTX A10
	5741 : 0, 5741
section RTX 3080M Ti
	5908 : 0, 5908
section RTX 3070
	5096 : 0, 5096
section RTX 3060 Ti
	5129 : 0, 5129
section RTX A4000
	4945 : 0, 4945
section RTX A5000M
	4461 : 0, 4461
section RTX 3060
	4070 : 0, 4070
section RTX 3060M
	4012 : 0, 4012
section A16 (1 GPU)
	2031 : 0, 2031
section A2
	2051 : 0, 2051
section RTX 3050M Ti
	2341 : 0, 2341
section RTX 3050M
	2339 : 0, 2339
section RTX 3050 6GB
	1898 : 0, 1898
section Titan RTX
	7554 : 0, 7554
section RTX 6000
	6879 : 0, 6879
section RTX 8000 Passive
	5607 : 0, 5607
section RTX 2080 Ti
	6853 : 0, 6853
section RTX 2080 Super
	5284 : 0, 5284
section RTX 5000
	4773 : 0, 4773
section RTX 2080
	4977 : 0, 4977
section RTX 2070 Super
	4893 : 0, 4893
section RTX 2070
	5017 : 0, 5017
section RTX 2060 Super
	5035 : 0, 5035
section RTX 4000
	4584 : 0, 4584
section RTX 2060 KO
	3376 : 0, 3376
section RTX 2060
	3604 : 0, 3604
section GTX 1660 Super
	3551 : 0, 3551
section T4
	2887 : 0, 2887
section GTX 1660 Ti
	3041 : 0, 3041
section GTX 1660
	1992 : 0, 1992
section GTX 1650M 896C
	1858 : 0, 1858
section GTX 1650M 1024C
	1400 : 0, 1400
section T500
	665 : 0, 665
section Titan Xp
	5495 : 0, 5495
section GTX 1080 Ti
	4877 : 0, 4877
section GTX 1080
	3182 : 0, 3182
section GTX 1060 6GB
	1925 : 0, 1925
section GTX 1060M
	1882 : 0, 1882
section GTX 1050M Ti
	1224 : 0, 1224
section P1000
	839 : 0, 839
section GTX TITAN X
	2665 : 0, 2665
section GTX 980 Ti
	2703 : 0, 2703
section GTX 980
	1965 : 0, 1965
section GTX 970
	1721 : 0, 1721
section M4000
	1519 : 0, 1519
section M60 (1 GPU)
	1571 : 0, 1571
section GTX 960M
	872 : 0, 872
section GTX 780 Ti
	2776 : 0, 2776
section GTX 770
	1215 : 0, 1215
section GTX 680 4GB
	1274 : 0, 1274
section GTX 670
	1220 : 0, 1220
section GTX 660
	1146 : 0, 1146
section GTX 660 OEM
	990 : 0, 990
section K2000
	444 : 0, 444
section GT 630 (OEM)
	185 : 0, 185
section GTX 580
	1481 : 0, 1481
section GTX 560 Ti
	895 : 0, 895
section GTX 480
	1264 : 0, 1264
section GTX 280
	462 : 0, 462
section GTX 260
	421 : 0, 421
section FX 5800
	296 : 0, 296
section NVS 290
	9 : 0, 9
section Arise 1020
	6 :active, 0, 6

section M2 Ultra (76-CU, 192GB)
	8769 :active, 0, 8769
section M2 Max (38-CU, 32GB)
	4641 :active, 0, 4641
section M2 Pro (19-CU, 16GB)
	2374 :active, 0, 2374
section M1 Ultra (64-CU, 128GB)
	8418 :active, 0, 8418
section M1 Max (24-CU, 32GB)
	4496 :active, 0, 4496
section M1 Pro (16-CU, 16GB)
	2329 :active, 0, 2329
section M1 (8-CU, 16GB)
	759 :active, 0, 759
section Radeon 8060S (Max+ 395)
	2563 :crit, 0, 2563
section Radeon 780M (Z1 Extreme)
	860 :crit, 0, 860
section Radeon Graphics (7800X3D)
	498 :crit, 0, 498
section Vega 8 (4750G)
	511 :crit, 0, 511
section Vega 8 (3500U)
	288 :crit, 0, 288
section Arc 140V GPU (16GB)
	1282 :done, 0, 1282
section Arc Graphics (Ultra 9 185H)
	724 :done, 0, 724
section Iris Xe Graphics (i7-1265U)
	621 :done, 0, 621
section UHD Xe 32EUs
	245 :done, 0, 245
section UHD 770
	475 :done, 0, 475
section UHD 630
	301 :done, 0, 301
section UHD P630
	288 :done, 0, 288
section HD 5500
	192 :done, 0, 192
section HD 4600
	115 :done, 0, 115
section Orange Pi 5 Mali-G610 MP4
	232 :active, 0, 232
section Samsung Mali-G72 MP18
	230 :active, 0, 230

section 2x EPYC 9754
	5179 :crit, 0, 5179
section 2x EPYC 9654
	4092 :crit, 0, 4092
section 2x EPYC 9554
	2552 :crit, 0, 2552
section 1x EPYC 9124
	772 :crit, 0, 772
section 2x EPYC 7713
	1418 :crit, 0, 1418
section 2x EPYC 7352
	739 :crit, 0, 739
section 2x EPYC 7313
	498 :crit, 0, 498
section 2x EPYC 7302
	784 :crit, 0, 784
section 2x Xeon 6980P
	7875 :done, 0, 7875
section 2x Xeon 6979P
	8135 :done, 0, 8135
section 2x Xeon 6960P
	5477 :done, 0, 5477
section 2x Platinum 8592+
	3135 :done, 0, 3135
section 2x Gold 6548N
	1811 :done, 0, 1811
section 2x CPU Max 9480
	2037 :done, 0, 2037
section 2x Platinum 8480+
	2162 :done, 0, 2162
section 2x Platinum 8470
	2068 :done, 0, 2068
section 2x Gold 6438Y+
	1945 :done, 0, 1945
section 2x Platinum 8380
	1410 :done, 0, 1410
section 2x Platinum 8358
	1285 :done, 0, 1285
section 2x Platinum 8256
	396 :done, 0, 396
section 2x Platinum 8153
	691 :done, 0, 691
section 2x Gold 6248R
	755 :done, 0, 755
section 2x Gold 6128
	254 :done, 0, 254
section Phi 7210
	415 :done, 0, 415
section 4x E5-4620 v4
	460 :done, 0, 460
section 2x E5-2630 v4
	264 :done, 0, 264
section 2x E5-2623 v4
	125 :done, 0, 125
section 2x E5-2680 v3
	304 :done, 0, 304
section GH200 Neoverse-V2
	1323 : 0, 1323
section TR PRO 7995WX
	1715 :crit, 0, 1715
section TR 3970X
	463 :crit, 0, 463
section TR 1950X
	273 :crit, 0, 273
section Ryzen 7900X3D
	521 :crit, 0, 521
section Ryzen 7800X3D
	363 :crit, 0, 363
section Ryzen 5700X3D
	229 :crit, 0, 229
section FX-6100
	22 :crit, 0, 22
section Athlon X2 QL-65
	3 :crit, 0, 3
section Ultra 7 258V
	287 :done, 0, 287
section Ultra 9 185H
	317 :done, 0, 317
section i9-14900K
	490 :done, 0, 490
section i7-13700K
	504 :done, 0, 504
section i7-1265U
	128 :done, 0, 128
section i9-11900KB
	208 :done, 0, 208
section i9-10980XE
	286 :done, 0, 286
section E-2288G
	198 :done, 0, 198
section i7-9700
	103 :done, 0, 103
section i5-9600
	147 :done, 0, 147
section i7-8700K
	152 :done, 0, 152
section E-2176G
	201 :done, 0, 201
section i7-7700HQ
	108 :done, 0, 108
section E3-1240 v5
	141 :done, 0, 141
section i5-5300U
	37 :done, 0, 37
section i7-4770
	104 :done, 0, 104
section i7-4720HQ
	80 :done, 0, 80
section N2807
	7 :done, 0, 7

Single-GPU/CPU Benchmark Table

Colors: 🔴 AMD, 🔵 Intel, 🟢 Nvidia, ⚪ Apple, 🟡 ARM, 🟤 Glenfly

Device	FP32 [TFlops/s]	Mem [GB]	BW [GB/s]	FP32/FP32 [MLUPs/s]	FP32/FP16S [MLUPs/s]	FP32/FP16C [MLUPs/s]

🔴 Instinct MI300X	163.40	192	5300	22867 (66%)	41327 (60%)	31670 (46%)
🔴 Instinct MI250 (1 GCD)	45.26	64	1638	5638 (53%)	9030 (42%)	8506 (40%)
🔴 Instinct MI210	45.26	64	1638	6517 (61%)	9547 (45%)	8829 (41%)
🔴 Instinct MI100	46.14	32	1228	5093 (63%)	8133 (51%)	8542 (54%)
🔴 Instinct MI60	14.75	32	1024	3570 (53%)	5047 (38%)	5111 (38%)
🔴 Instinct MI50 32GB	13.25	32	1024	4446 (66%)	8477 (64%)	4406 (33%)
🔴 Radeon VII	13.83	16	1024	4898 (73%)	7778 (58%)	5256 (40%)
🔵 Data Center GPU Max 1100	22.22	48	1229	3769 (47%)	6303 (39%)	3520 (22%)
🟢 B200 SXM6 180GB	74.45	180	8000	42152 (81%)	55609 (54%)	22695 (22%)
🟢 H200 SXM5 141GB	66.91	141	4800	23056 (73%)	36610 (59%)	20291 (33%)
🟢 GH200 94GB GPU	66.91	94	4000	20595 (79%)	34689 (67%)	19407 (37%)
🟢 H100 NVL	60.32	94	3938	20303 (79%)	32922 (64%)	18424 (36%)
🟢 H100 SXM5 80GB HBM3	66.91	80	3350	17602 (80%)	29561 (68%)	20227 (46%)
🟢 H100 PCIe 80GB HBM2e	51.01	80	2000	11128 (85%)	20624 (79%)	13862 (53%)
🟢 A100 SXM4 80GB	19.49	80	2039	10228 (77%)	18448 (70%)	11197 (42%)
🟢 A100 PCIe 80GB	19.49	80	1935	9657 (76%)	17896 (71%)	10817 (43%)
🟢 PG506-243 / PG506-242	22.14	64	1638	8195 (77%)	15654 (74%)	12271 (58%)
🟢 A100 SXM4 40GB	19.49	40	1555	8522 (84%)	16013 (79%)	11251 (56%)
🟢 A100 PCIe 40GB	19.49	40	1555	8526 (84%)	16035 (79%)	11088 (55%)
🟢 CMP 170HX	6.32	8	1493	7684 (79%)	12392 (64%)	6859 (35%)
🟢 A30	10.32	24	933	5004 (82%)	9721 (80%)	5726 (47%)
🟢 Tesla V100 SXM2 32GB	15.67	32	900	4471 (76%)	8947 (77%)	7217 (62%)
🟢 Tesla V100 PCIe 16GB	14.13	16	900	5128 (87%)	10325 (88%)	7683 (66%)
🟢 Quadro GV100	16.66	32	870	3442 (61%)	6641 (59%)	5863 (52%)
🟢 Titan V	14.90	12	653	3601 (84%)	7253 (86%)	6957 (82%)
🟢 Tesla P100 16GB	9.52	16	732	3295 (69%)	5950 (63%)	4176 (44%)
🟢 Tesla P100 12GB	9.52	12	549	2427 (68%)	4141 (58%)	3999 (56%)
🟢 GeForce GTX TITAN	4.71	6	288	1460 (77%)	2500 (67%)	1113 (30%)
🟢 Tesla K40m	4.29	12	288	1131 (60%)	1868 (50%)	912 (24%)
🟢 Tesla K80 (1 GPU)	4.11	12	240	916 (58%)	1642 (53%)	943 (30%)
🟢 Tesla K20c	3.52	5	208	861 (63%)	1507 (56%)	720 (27%)

🔴 Radeon RX 9070 XT	48.66	16	640	3089 (74%)	6688 (80%)	6090 (73%)
🔴 Radeon RX 9070	36.13	16	640	3007 (72%)	5746 (69%)	6019 (72%)
🔴 Radeon RX 7900 XTX	61.44	24	960	3665 (58%)	7644 (61%)	7716 (62%)
🔴 Radeon PRO W7900	61.30	48	864	3107 (55%)	5939 (53%)	5780 (52%)
🔴 Radeon RX 7900 XT	51.61	20	800	3013 (58%)	5856 (56%)	5986 (58%)
🔴 Radeon RX 7800 XT	37.32	16	624	1704 (42%)	3105 (38%)	3061 (38%)
🔴 Radeon PRO W7800	45.20	32	576	1872 (50%)	4426 (59%)	4145 (55%)
🔴 Radeon RX 7900 GRE	42.03	16	576	1996 (53%)	4570 (61%)	4463 (60%)
🔴 Radeon PRO W7700	28.30	16	576	1547 (41%)	2943 (39%)	2899 (39%)
🔴 Radeon RX 7700 XT	30.77	12	432	1405 (50%)	2702 (48%)	2828 (50%)
🔴 Radeon RX 7600	21.75	8	288	1250 (66%)	2561 (68%)	2512 (67%)
🔴 Radeon PRO W7600	20.00	8	288	1179 (63%)	2263 (61%)	2287 (61%)
🔴 Radeon PRO W7500	12.20	8	172	856 (76%)	1630 (73%)	1682 (75%)
🔴 Radeon RX 6900 XT	23.04	16	512	1968 (59%)	4227 (64%)	4207 (63%)
🔴 Radeon RX 6800 XT	20.74	16	512	2008 (60%)	4241 (64%)	4224 (64%)
🔴 Radeon PRO W6800	17.83	32	512	1620 (48%)	3361 (51%)	3180 (48%)
🔴 Radeon RX 6700 XT	13.21	12	384	1408 (56%)	2883 (58%)	2908 (58%)
🔴 Radeon RX 6750 GRE 12GB	12.49	12	384	1298 (52%)	2840 (57%)	2848 (57%)
🔴 Radeon RX 6800M	11.78	12	384	1439 (57%)	3190 (64%)	3213 (64%)
🔴 Radeon RX 6700M	10.60	10	320	1194 (57%)	2388 (57%)	2429 (58%)
🔴 Radeon RX 6600	8.93	8	224	963 (66%)	1817 (62%)	1839 (63%)
🔴 Radeon RX 6500 XT	5.77	4	144	459 (49%)	1011 (54%)	1030 (55%)
🔴 Radeon RX 5700 XT	9.75	8	448	1368 (47%)	3253 (56%)	3049 (52%)
🔴 Radeon RX 5700	7.72	8	448	1521 (52%)	3167 (54%)	2758 (47%)
🔴 Radeon RX 5600 XT	6.73	6	288	1136 (60%)	2214 (59%)	2148 (57%)
🔴 Radeon RX Vega 64	13.35	8	484	1875 (59%)	2878 (46%)	3227 (51%)
🔴 Radeon RX 590	5.53	8	256	1257 (75%)	1573 (47%)	1688 (51%)
🔴 Radeon RX 580 4GB	6.50	4	256	946 (57%)	1848 (56%)	1577 (47%)
🔴 Radeon RX 580 2048SP 8GB	4.94	8	224	868 (59%)	1622 (56%)	1240 (43%)
🔴 Radeon RX 480 8GB	5.83	8	256	1104 (66%)	1908 (57%)	1550 (47%)
🔴 Radeon Pro WX 5100	3.89	8	160	951 (91%)	1604 (77%)	927 (45%)
🔴 Radeon R9 Fury X	8.60	4	512	1628 (49%)	2880 (43%)	2507 (38%)
🔴 Radeon R9 Nano	8.19	4	512	1575 (47%)	2761 (42%)	2224 (33%)
🔴 Radeon R9 390X	5.91	8	384	1733 (69%)	2217 (44%)	1722 (35%)
🔴 Radeon R9 290X	5.63	4	320	1245 (60%)	1699 (41%)	1387 (33%)
🔴 Radeon R9 290	4.85	4	320	1136 (54%)	1647 (40%)	1272 (31%)
🔴 Radeon HD 7970	3.79	3	264	193 (11%)	1270 (37%)	1563 (46%)
🔴 Radeon HD 7870	2.56	2	154	105 (10%)	716 (36%)	868 (44%)
🔴 Radeon HD 7850	1.84	2	154	112 (11%)	120 ( 6%)	635 (32%)
🔴 Radeon HD 5870	2.72	1	154	257 (26%)	111 ( 6%)	141 ( 7%)
🔴 Radeon HD 5770	1.36	1	77	132 (26%)	63 ( 6%)	82 ( 8%)
🔴 FirePro V5800	1.10	1	64	108 (26%)	52 ( 6%)	66 ( 8%)
🔵 Arc B580 LE	14.59	12	456	2549 (86%)	5153 (87%)	4879 (82%)
🔵 Arc B570	12.67	10	380	2078 (84%)	4186 (85%)	4158 (84%)
🔵 Arc A770 LE	19.66	16	560	2663 (73%)	4568 (63%)	4519 (62%)
🔵 Arc A750 LE	17.20	8	512	2555 (76%)	4314 (65%)	4047 (61%)
🔵 Arc A580	12.29	8	512	2534 (76%)	3889 (58%)	3488 (52%)
🔵 Arc Pro A40	5.02	6	192	594 (47%)	985 (40%)	927 (37%)
🔵 Arc A380	4.20	6	186	622 (51%)	1097 (45%)	1115 (46%)
🟢 RTX PRO 6000 Blackwell	125.95	96	1792	10349 (88%)	20511 (88%)	20841 (90%)
🟢 GeForce RTX 5090	104.88	32	1792	9522 (81%)	18459 (79%)	19141 (82%)
🟢 GeForce RTX 5080	56.34	16	960	5174 (82%)	10252 (82%)	10304 (83%)
🟢 GeForce RTX 5070	30.84	12	672	3658 (83%)	7238 (83%)	7107 (81%)
🟢 GeForce RTX 4090	82.58	24	1008	5624 (85%)	11091 (85%)	11496 (88%)
🟢 RTX 6000 Ada	91.10	48	960	4997 (80%)	10249 (82%)	10293 (83%)
🟢 L40S	91.61	48	864	3788 (67%)	7637 (68%)	7617 (68%)
🟢 L40	90.52	48	864	3870 (69%)	7778 (69%)	7945 (71%)
🟢 GeForce RTX 4080 Super	52.22	16	736	4089 (85%)	7660 (80%)	8218 (86%)
🟢 GeForce RTX 4080	55.45	16	717	3914 (84%)	7626 (82%)	7933 (85%)
🟢 GeForce RTX 4070 Ti Super	44.10	16	672	3694 (84%)	6435 (74%)	7295 (84%)
🟢 GeForce RTX 4090M	28.31	16	576	3367 (89%)	6545 (87%)	6901 (92%)
🟢 GeForce RTX 4070 Super	35.55	12	504	2751 (83%)	5149 (79%)	5554 (85%)
🟢 GeForce RTX 4070	29.15	12	504	2646 (80%)	4548 (69%)	5016 (77%)
🟢 GeForce RTX 4080M	33.85	12	432	2577 (91%)	5086 (91%)	5114 (91%)
🟢 RTX 4000 Ada	26.73	20	360	2130 (91%)	3964 (85%)	4221 (90%)
🟢 L4	30.29	24	300	1490 (76%)	2854 (73%)	2857 (73%)
🟢 GeForce RTX 4060	15.11	8	272	1614 (91%)	3052 (86%)	3124 (88%)
🟢 GeForce RTX 4070M	18.25	8	256	1553 (93%)	2945 (89%)	3092 (93%)
🟢 RTX 2000 Ada	12.00	16	224	1351 (92%)	2452 (84%)	2526 (87%)
🟢 GeForce RTX 3090 Ti	40.00	24	1008	5717 (87%)	10956 (84%)	10400 (79%)
🟢 GeForce RTX 3090	39.05	24	936	5418 (89%)	10732 (88%)	10215 (84%)
🟢 GeForce RTX 3080 Ti	37.17	12	912	5202 (87%)	9832 (87%)	9347 (79%)
🟢 GeForce RTX 3080 12GB	32.26	12	912	5071 (85%)	9657 (81%)	8615 (73%)
🟢 RTX A6000	40.00	48	768	4421 (88%)	8814 (88%)	8533 (86%)
🟢 RTX A5000	27.77	24	768	4414 (88%)	8617 (86%)	6971 (70%)
🟢 GeForce RTX 3080 10GB	29.77	10	760	4230 (85%)	8118 (82%)	7714 (78%)
🟢 A40	37.42	48	696	3378 (74%)	6605 (73%)	6622 (73%)
🟢 GeForce RTX 3070 Ti	21.75	8	608	3490 (88%)	6807 (86%)	5926 (75%)
🟢 A10	31.24	24	600	2931 (75%)	5741 (74%)	5597 (72%)
🟢 GeForce RTX 3080M Ti	23.61	16	512	2985 (89%)	5908 (89%)	5780 (87%)
🟢 GeForce RTX 3070	20.31	8	448	2578 (88%)	5096 (88%)	5060 (87%)
🟢 GeForce RTX 3060 Ti	16.49	8	448	2644 (90%)	5129 (88%)	4718 (81%)
🟢 RTX A4000	19.17	16	448	2500 (85%)	4945 (85%)	4664 (80%)
🟢 RTX A5000M	16.59	16	448	2228 (76%)	4461 (77%)	3662 (63%)
🟢 GeForce RTX 3060	13.17	12	360	2108 (90%)	4070 (87%)	3566 (76%)
🟢 GeForce RTX 3060M	10.94	6	336	2019 (92%)	4012 (92%)	3572 (82%)
🟢 A16 (1 GPU)	4.49	16	200	1031 (79%)	2031 (78%)	1133 (44%)
🟢 A2	4.53	15	200	1031 (79%)	2051 (79%)	1199 (46%)
🟢 GeForce RTX 3050M Ti	7.60	4	192	1181 (94%)	2341 (94%)	2253 (90%)
🟢 GeForce RTX 3050M	7.13	4	192	1180 (94%)	2339 (94%)	2016 (81%)
🟢 GeForce RTX 3050 6GB	6.77	6	168	993 (90%)	1898 (87%)	1879 (86%)
🟢 Titan RTX	16.31	24	672	3471 (79%)	7456 (85%)	7554 (87%)
🟢 Quadro RTX 6000	16.31	24	672	3307 (75%)	6836 (78%)	6879 (79%)
🟢 Quadro RTX 8000 Passive	14.93	48	624	2591 (64%)	5408 (67%)	5607 (69%)
🟢 GeForce RTX 2080 Ti	13.45	11	616	3194 (79%)	6700 (84%)	6853 (86%)
🟢 GeForce RTX 2080 Super	11.34	8	496	2434 (75%)	5284 (82%)	5087 (79%)
🟢 Quadro RTX 5000	11.15	16	448	2341 (80%)	4766 (82%)	4773 (82%)
🟢 GeForce RTX 2080	10.07	8	448	2318 (79%)	4977 (86%)	4963 (85%)
🟢 GeForce RTX 2070 Super	9.22	8	448	2255 (77%)	4866 (84%)	4893 (84%)
🟢 GeForce RTX 2070	7.47	8	448	2444 (83%)	4387 (75%)	5017 (86%)
🟢 GeForce RTX 2060 Super	7.18	8	448	2503 (85%)	5035 (87%)	4463 (77%)
🟢 Quadro RTX 4000	7.12	8	416	2284 (84%)	4584 (85%)	4062 (75%)
🟢 GeForce RTX 2060 KO	6.74	6	336	1643 (75%)	3376 (77%)	3266 (75%)
🟢 GeForce RTX 2060	6.74	6	336	1681 (77%)	3604 (83%)	3571 (82%)
🟢 GeForce GTX 1660 Super	5.03	6	336	1696 (77%)	3551 (81%)	3040 (70%)
🟢 Tesla T4	8.14	15	300	1356 (69%)	2869 (74%)	2887 (74%)
🟢 GeForce GTX 1660 Ti	5.48	6	288	1467 (78%)	3041 (81%)	3019 (81%)
🟢 GeForce GTX 1660	5.07	6	192	1016 (81%)	1924 (77%)	1992 (80%)
🟢 GeForce GTX 1650M 896C	2.72	4	192	963 (77%)	1836 (74%)	1858 (75%)
🟢 GeForce GTX 1650M 1024C	3.20	4	128	706 (84%)	1214 (73%)	1400 (84%)
🟢 T500	3.04	4	80	339 (65%)	578 (56%)	665 (64%)
🟢 Titan Xp	12.15	12	548	2919 (82%)	5495 (77%)	5375 (76%)
🟢 GeForce GTX 1080 Ti	12.06	11	484	2631 (83%)	4837 (77%)	4877 (78%)
🟢 GeForce GTX 1080	9.78	8	320	1623 (78%)	3100 (75%)	3182 (77%)
🟢 GeForce GTX 1060 6GB	4.57	6	192	997 (79%)	1925 (77%)	1785 (72%)
🟢 GeForce GTX 1060M	4.44	6	192	983 (78%)	1882 (75%)	1803 (72%)
🟢 GeForce GTX 1050M Ti	2.49	4	112	631 (86%)	1224 (84%)	1115 (77%)
🟢 Quadro P1000	1.89	4	82	426 (79%)	839 (79%)	778 (73%)
🟢 GeForce GTX TITAN X	6.61	12	336	1506 (69%)	2665 (61%)	2518 (58%)
🟢 GeForce GTX 980 Ti	6.05	6	336	1509 (69%)	2703 (62%)	2381 (55%)
🟢 GeForce GTX 980	4.98	4	224	1018 (70%)	1965 (68%)	1872 (64%)
🟢 GeForce GTX 970	4.17	4	224	980 (67%)	1721 (59%)	1623 (56%)
🟢 Quadro M4000	2.57	8	192	899 (72%)	1519 (61%)	1050 (42%)
🟢 Tesla M60 (1 GPU)	4.82	8	160	853 (82%)	1571 (76%)	1557 (75%)
🟢 GeForce GTX 960M	1.51	4	80	442 (84%)	872 (84%)	627 (60%)
🟢 GeForce GTX 780 Ti	5.35	3	336	1710 (78%)	2776 (64%)	1302 (30%)
🟢 GeForce GTX 770	3.33	2	224	800 (55%)	1215 (42%)	876 (30%)
🟢 GeForce GTX 680 4GB	3.33	4	192	783 (62%)	1274 (51%)	814 (33%)
🟢 GeForce GTX 670	2.81	2	192	747 (59%)	1220 (49%)	731 (29%)
🟢 GeForce GTX 660	1.98	2	144	505 (54%)	1146 (61%)	489 (26%)
🟢 GeForce GTX 660 OEM	2.05	1.5	134	556 (63%)	990 (57%)	530 (30%)
🟢 Quadro K2000	0.73	2	64	312 (75%)	444 (53%)	171 (21%)
🟢 GeForce GT 630 (OEM)	0.46	2	29	151 (81%)	185 (50%)	78 (21%)
🟢 GeForce GTX 580	1.58	1.5	192	1028 (82%)	1481 (59%)	599 (24%)
🟢 GeForce GTX 560 Ti	1.26	1	128	661 (79%)	895 (54%)	415 (25%)
🟢 GeForce GTX 480	1.35	1.5	177	908 (78%)	1264 (55%)	509 (22%)
🟢 GeForce GTX 280	0.62	1	142	409 (44%)	462 (25%)	209 (11%)
🟢 GeForce GTX 260	0.48	0.9	111	345 (47%)	421 (29%)	160 (11%)
🟢 Quadro FX 5800	0.62	4	102	296 (44%)	268 (20%)	211 (16%)
🟢 Quadro NVS 290	0.03	0.25	6	9 (22%)	4 ( 5%)	4 ( 5%)
🟤 Arise 1020	1.50	2	19	6 ( 5%)	6 ( 2%)	6 ( 2%)

⚪ M2 Ultra GPU 76CU 192GB	19.46	147	800	4629 (89%)	8769 (84%)	7972 (77%)
⚪ M2 Max GPU 38CU 32GB	9.73	22	400	2405 (92%)	4641 (89%)	2444 (47%)
⚪ M2 Pro GPU 19CU 16GB	4.86	11	200	1195 (91%)	2374 (91%)	2173 (84%)
⚪ M1 Ultra GPU 64CU 128GB	16.38	98	800	4519 (86%)	8418 (81%)	6915 (67%)
⚪ M1 Max GPU 24CU 32GB	6.14	22	400	2369 (91%)	4496 (87%)	2777 (53%)
⚪ M1 Pro GPU 16CU 16GB	4.10	11	200	1204 (92%)	2329 (90%)	1855 (71%)
⚪ M1 GPU 8CU 16GB	2.05	11	68	384 (86%)	758 (85%)	759 (86%)
🔴 Radeon 8060S (Max+ 395)	29.70	15	256	1231 (74%)	2541 (76%)	2563 (77%)
🔴 Radeon 780M (Z1 Extreme)	8.29	8	102	443 (66%)	860 (65%)	820 (62%)
🔴 Radeon Graphics (7800X3D)	0.56	12	102	338 (51%)	498 (37%)	283 (21%)
🔴 Radeon Vega 8 (4750G)	2.15	27	57	263 (71%)	511 (70%)	501 (68%)
🔴 Radeon Vega 8 (3500U)	1.23	7	38	157 (63%)	282 (57%)	288 (58%)
🔵 Arc 140V GPU (16GB)	3.99	16	137	636 (71%)	1282 (72%)	773 (44%)
🔵 Arc Graphics (Ultra 9 185H)	4.81	14	90	271 (46%)	710 (61%)	724 (62%)
🔵 Iris Xe Graphics (i7-1265U)	1.92	13	77	342 (68%)	621 (62%)	574 (58%)
🔵 UHD Graphics Xe 32EUs	0.74	25	51	128 (38%)	245 (37%)	216 (32%)
🔵 UHD Graphics 770	0.82	30	90	342 (58%)	475 (41%)	278 (24%)
🔵 UHD Graphics 630	0.46	7	51	151 (45%)	301 (45%)	187 (28%)
🔵 UHD Graphics P630	0.46	51	42	177 (65%)	288 (53%)	137 (25%)
🔵 HD Graphics 5500	0.35	3	26	75 (45%)	192 (58%)	108 (32%)
🔵 HD Graphics 4600	0.38	2	26	105 (63%)	115 (35%)	34 (10%)
🟡 Mali-G610 MP4 (Orange Pi 5)	0.06	16	34	130 (58%)	232 (52%)	93 (21%)
🟡 Mali-G72 MP18 (Samsung S9+)	0.24	4	29	110 (59%)	230 (62%)	21 ( 6%)

🔴 2x EPYC 9754	50.79	3072	922	3276 (54%)	5077 (42%)	5179 (43%)
🔴 2x EPYC 9654	43.62	1536	922	3087 (51%)	3675 (31%)	4092 (34%)
🔴 2x EPYC 9554	30.72	384	922	2552 (42%)	2127 (18%)	2144 (18%)
🔴 1x EPYC 9124	3.69	128	307	772 (38%)	579 (15%)	586 (15%)
🔴 2x EPYC 7713	8.19	512	410	1298 (48%)	492 ( 9%)	1418 (27%)
🔴 2x EPYC 7352	3.53	512	410	739 (28%)	106 ( 2%)	412 ( 8%)
🔴 2x EPYC 7313	3.07	128	410	498 (19%)	367 ( 7%)	418 ( 8%)
🔴 2x EPYC 7302	3.07	128	410	784 (29%)	336 ( 6%)	411 ( 8%)
🔵 2x Xeon 6980P	98.30	6144	1690	7875 (71%)	5112 (23%)	5610 (26%)
🔵 2x Xeon 6979P	92.16	3072	1690	8135 (74%)	4175 (19%)	4622 (21%)
🔵 2x Xeon 6960P	70.04	2304	1229	5477 (68%)	3387 (21%)	3986 (25%)
🔵 2x Xeon Platinum 8592+	31.13	1024	717	3135 (67%)	2359 (25%)	2466 (26%)
🔵 2x Xeon Gold 6548N	22.94	2048	666	1811 (42%)	1388 (16%)	1425 (16%)
🔵 2x Xeon CPU Max 9480	27.24	256	614	2037 (51%)	1520 (19%)	1464 (18%)
🔵 2x Xeon Platinum 8480+	28.67	512	614	2162 (54%)	1845 (23%)	1884 (24%)
🔵 2x Xeon Platinum 8470	25.29	2048	614	1865 (46%)	1909 (24%)	2068 (26%)
🔵 2x Xeon Gold 6438Y+	16.38	1024	614	1945 (48%)	1219 (15%)	1257 (16%)
🔵 2x Xeon Platinum 8380	23.55	2048	410	1410 (53%)	1159 (22%)	1298 (24%)
🔵 2x Xeon Platinum 8358	21.30	256	410	1285 (48%)	1007 (19%)	1120 (21%)
🔵 2x Xeon Platinum 8256	3.89	1536	282	396 (22%)	158 ( 4%)	175 ( 5%)
🔵 2x Xeon Platinum 8153	8.19	384	256	691 (41%)	290 ( 9%)	328 (10%)
🔵 2x Xeon Gold 6248R	18.43	384	282	755 (41%)	566 (15%)	694 (19%)
🔵 2x Xeon Gold 6128	5.22	192	256	254 (15%)	185 ( 6%)	193 ( 6%)
🔵 Xeon Phi 7210	5.32	192	102	415 (62%)	193 (15%)	223 (17%)
🔵 4x Xeon E5-4620 v4	2.69	512	273	460 (26%)	275 ( 8%)	239 ( 7%)
🔵 2x Xeon E5-2630 v4	1.41	64	137	264 (30%)	146 ( 8%)	129 ( 7%)
🔵 2x Xeon E5-2623 v4	0.67	64	137	125 (14%)	66 ( 4%)	59 ( 3%)
🔵 2x Xeon E5-2680 v3	1.92	128	137	304 (34%)	234 (13%)	291 (16%)
🟢 GH200 Neoverse-V2 CPU	7.88	480	384	1323 (53%)	853 (17%)	683 (14%)
🔴 Threadripper PRO 7995WX	15.36	256	333	1134 (52%)	1697 (39%)	1715 (40%)
🔴 Threadripper 3970X	3.79	128	102	376 (56%)	103 ( 8%)	463 (35%)
🔴 Threadripper 1950X	0.87	128	85	273 (49%)	43 ( 4%)	151 (14%)
🔴 Ryzen 9 7900X3D	1.69	128	83	278 (51%)	521 (48%)	462 (43%)
🔴 Ryzen 7 7800X3D	1.08	32	102	296 (44%)	361 (27%)	363 (27%)
🔴 Ryzen 7 5700X3D	0.87	32	51	229 (68%)	135 (20%)	173 (26%)
🔴 FX-6100	0.16	16	26	11 ( 7%)	11 ( 3%)	22 ( 7%)
🔴 Athlon X2 QL-65	0.03	4	11	3 ( 4%)	2 ( 2%)	3 ( 2%)
🔵 Core Ultra 7 258V	0.56	32	137	287 (32%)	123 ( 7%)	167 ( 9%)
🔵 Core Ultra 9 185H	1.79	16	90	317 (54%)	267 (23%)	288 (25%)
🔵 Core i9-14900K	3.74	32	96	443 (71%)	453 (36%)	490 (39%)
🔵 Core i7-13700K	2.51	64	90	504 (86%)	398 (34%)	424 (36%)
🔵 Core i7-1265U	1.23	32	77	128 (26%)	62 ( 6%)	58 ( 6%)
🔵 Core i9-11900KB	0.84	32	51	109 (33%)	195 (29%)	208 (31%)
🔵 Core i9-10980XE	3.23	128	94	286 (47%)	251 (21%)	223 (18%)
🔵 Xeon E-2288G	0.95	32	43	196 (70%)	182 (33%)	198 (36%)
🔵 Core i7-9700	0.77	64	43	103 (37%)	62 (11%)	95 (17%)
🔵 Core i5-9600	0.60	16	43	146 (52%)	127 (23%)	147 (27%)
🔵 Core i7-8700K	0.71	16	51	152 (45%)	134 (20%)	116 (17%)
🔵 Xeon E-2176G	0.71	64	42	201 (74%)	136 (25%)	148 (27%)
🔵 Core i7-7700HQ	0.36	12	38	81 (32%)	82 (16%)	108 (22%)
🔵 Xeon E3-1240 v5	0.50	32	34	141 (63%)	75 (17%)	88 (20%)
🔵 Core i7-4770	0.44	16	26	104 (62%)	69 (21%)	59 (18%)
🔵 Core i7-4720HQ	0.33	16	26	80 (48%)	23 ( 7%)	60 (18%)
🔵 Celeron N2807	0.01	4	11	7 (10%)	3 ( 2%)	3 ( 2%)

Multi-GPU Benchmarks

Multi-GPU benchmarks are done at the largest possible grid resolution with cubic domains, and either 2x1x1, 2x2x1 or 2x2x2 of these domains together. The (percentages in round brackets) are single-GPU roofline model efficiency, and the (multiplicators in round brackets) are scaling factors relative to benchmarked single-GPU performance.

gantt

title FluidX3D Performance [MLUPs/s] - FP32 arithmetic, (fastest of FP32/FP16S/FP16C) memory storage
dateFormat X
axisFormat %s
%%{
	init: {
		"gantt": {
			'titleTopMargin': 42,
			'topPadding': 70,
			'leftPadding': 260,
			'rightPadding': 5,
			'sectionFontSize': 20,
			'fontSize': 20,
			'barHeight': 20,
			'barGap': 3,
			'numberSectionStyles': 2
		},
		'theme': 'forest',
		'themeVariables': {
			'sectionBkgColor': '#99999999',
			'altSectionBkgColor': '#00000000',
			'titleColor': '#AFAFAF',
			'textColor': '#AFAFAF',
			'taskTextColor': 'black',
			'taskBorderColor': '#487E3A'
		}
	}
}%%


section 8x MI300X
	204924 :crit, 0, 204924
section 4x MI300X
	109546 :crit, 0, 109546
section 2x MI300X
	61053 :crit, 0, 61053
section 1x MI300X
	41327 :crit, 0, 41327

section 4x MI250 (8 GCD)
	53521 :crit, 0, 53521
section 2x MI250 (4 GCD)
	29627 :crit, 0, 29627
section 1x MI250 (2 GCD
	17338 :crit, 0, 17338
section 1x MI250 (1 GCD)
	9030 :crit, 0, 9030

section 32x MI210 GigaIO
	50952 :crit, 0, 50952
section 24x MI210 GigaIO
	45033 :crit, 0, 45033
section 16x MI210 GigaIO
	37922 :crit, 0, 37922
section 8x MI210 GigaIO
	27996 :crit, 0, 27996
section 4x MI210 GigaIO
	17232 :crit, 0, 17232
section 2x MI210 GigaIO
	13539 :crit, 0, 13539
section 1x MI210 GigaIO
	9105 :crit, 0, 9105

section 4x MI210
	31408 :crit, 0, 31408
section 2x MI210
	16156 :crit, 0, 16156
section 1x MI210
	8757 :crit, 0, 8757

section 3x  MI50 + 1x A100 40GB
	22759 :active,crit, 0, 22759
section 3x MI50 32GB
	21693 :crit, 0, 21693
section 2x MI50 32GB
	14484 :crit, 0, 14484
section 1x MI50 32GB
	8477 :crit, 0, 8477

section 8x Radeon VII
	30826 :crit, 0, 30826
section 4x Radeon VII
	24273 :crit, 0, 24273
section 2x Radeon VII
	15591 :crit, 0, 15591
section 1x Radeon VII
	7778 :crit, 0, 7778

section 1x Radeon Pro Duo (2 GPUs)
	3310 :crit, 0, 3310
section 1x R9 295X2 (2 GPUs)
	2428 :crit, 0, 2428
section 1x HD 7990 (2 GPUs)
	2314 :crit, 0, 2314
section 1x HD 6990 (2 GPUs)
	344 :crit, 0, 344
section 1x HD 5970 (2 GPUs)
	360 :crit, 0, 360

section 4x GPU Max 1100
	22777 :done, 0, 22777
section 2x GPU Max 1100
	11815 :done, 0, 11815
section 1x GPU Max 1100
	6209 :done, 0, 6209

section 8x B200 SXM6 180GB
	219300 : 0, 219300
section 4x B200 SXM6 180GB
	147446 : 0, 147446
section 2x B200 SXM6 180GB
	85077 : 0, 85077
section 1x B200 SXM6 180GB
	55609 : 0, 55609

section 8x H200 SXM5 141GB
	157743 : 0, 157743
section 4x H200 SXM5 141GB
	96056 : 0, 96056
section 2x H200 SXM5 141GB
	57070 : 0, 57070
section 1x H200 SXM5 141GB
	36610 : 0, 36610

section 4x H100 NVL
	82122 : 0, 82122
section 2x H100 NVL
	49958 : 0, 49958
section 1x H100 NVL
	32922 : 0, 32922

section 4x H100 SXM5 80GB HBM3
	78462 : 0, 78462
section 2x H100 SXM5 80GB HBM3
	46189 : 0, 46189
section 1x H100 SXM5 80GB HBM3
	28522 : 0, 28522

section 4x A100 PCIe 80GB
	52056 : 0, 52056
section 2x A100 PCIe 80GB
	27165 : 0, 27165
section 1x A100 PCIe 80GB
	17896 : 0, 17896

section 4x PG506-243/242
	41088 : 0, 41088
section 2x PG506-243/242
	24168 : 0, 24168
section 1x PG506-243/242
	15654 : 0, 15654

section 8x A100 SXM4 40GB
	72965 : 0, 72965
section 4x A100 SXM4 40GB
	42400 : 0, 42400
section 2x A100 SXM4 40GB
	23707 : 0, 23707
section 1x A100 SXM4 40GB
	15917 : 0, 15917

section 4x V100 SXM2 32GB
	26527 : 0, 26527
section 2x V100 SXM2 32GB
	15469 : 0, 15469
section 1x V100 SXM2 32GB
	8947 : 0, 8947

section 3x K40m + 1x Titan Xp
	5174 : 0, 5174
section 2x Tesla K40m
	3300 : 0, 3300
section 1x Tesla K40m
	1868 : 0, 1868

section 1x Tesla K80 (2 GPUs)
	3448 : 0, 3448
section 1x Tesla K80 (1 GPU)
	1642 : 0, 1642

section 2x L40S
	13640 : 0, 13640
section 1x L40S
	7669 : 0, 7669

section 2x L40
	14164 : 0, 14164
section 1x L40
	7945 : 0, 7945

section 8x RTX A6000
	40063 : 0, 40063
section 4x RTX A6000
	27915 : 0, 27915
section 2x RTX A6000
	15026 : 0, 15026
section 1x RTX A6000
	8814 : 0, 8814

section 4x A16 (16 GPUs)
	22451 : 0, 22451
section 2x A16 (8 GPUs)
	11777 : 0, 11777
section 1x A16 (4 GPUs)
	6348 : 0, 6348
section 1x A16 (2 GPUs)
	3475 : 0, 3475
section 1x A16 (1 GPU)
	2031 : 0, 2031

section 2x A2
	3539 : 0, 3539
section 1x A2
	2051 : 0, 2051

section 2x Quadro RTX 8000 Pa.
	10214 : 0, 10214
section 1x Quadro RTX 8000 Pa.
	5607 : 0, 5607

section 7x 2080 Ti + 1x A100 40GB
	33857 : 0, 33857
section 4x GeForce RTX 2080 Ti
	18598 : 0, 18598
section 2x GeForce RTX 2080 Ti
	10922 : 0, 10922
section 1x GeForce RTX 2080 Ti
	6853 : 0, 6853

section 1x GTX 690 (2 GPUs)
	920 : 0, 920

section 2x Arc A770
	8745 :done, 0, 8745
section 1x Arc A770
	4568 :done, 0, 4568

section 1x A100 + 1x P100 + 2x A2 + 3x MI50 + 1x A770
	17296 :active,crit, 0, 17296
section 1x 7700 XT + 1x B580 + 1x Titan Xp
	8358 :active,crit, 0, 8358
section 1x A770 + 1x Titan Xp
	8380 :active,done, 0, 8380

Multi-GPU Benchmark Table

Colors: 🔴 AMD, 🔵 Intel, 🟢 Nvidia, ⚪ Apple, 🟡 ARM, 🟤 Glenfly

Device	FP32 [TFlops/s]	Mem [GB]	BW [GB/s]	FP32/FP32 [MLUPs/s]	FP32/FP16S [MLUPs/s]	FP32/FP16C [MLUPs/s]

🔴 8x Instinct MI300X	1307.20	1536	42400	152835 (6.7x)	192297 (4.7x)	204924 (6.5x)
🔴 4x Instinct MI300X	653.60	768	21200	83678 (3.7x)	103200 (2.5x)	109546 (3.5x)
🔴 2x Instinct MI300X	326.80	384	10600	46673 (2.0x)	61053 (1.5x)	57391 (1.8x)
🔴 1x Instinct MI300X	163.40	192	5300	22867 (66%)	41327 (60%)	31670 (46%)

🔴 4x Instinct MI250 (8 GCD)	362.08	512	13107	27350 (4.9x)	52258 (5.8x)	53521 (6.3x)
🔴 2x Instinct MI250 (4 GCD)	181.04	256	6554	16925 (3.0x)	29163 (3.2x)	29627 (3.5x)
🔴 1x Instinct MI250 (2 GCD)	90.52	128	3277	9460 (1.7x)	14313 (1.6x)	17338 (2.0x)
🔴 1x Instinct MI250 (1 GCD)	45.26	64	1638	5638 (53%)	9030 (42%)	8506 (40%)

🔴 32x Instinct MI210 GigaIO	1448.32	2048	52429	23881 (3.8x)	50952 (6.0x)	48848 (5.4x)
🔴 24x Instinct MI210 GigaIO	1086.24	1536	39322	22056 (3.5x)	45033 (5.3x)	44631 (4.9x)
🔴 16x Instinct MI210 GigaIO	724.16	1024	26214	18094 (2.9x)	37360 (4.4x)	37922 (4.2x)
🔴 8x Instinct MI210 GigaIO	362.08	512	13107	13546 (2.1x)	27996 (3.3x)	27820 (3.1x)
🔴 4x Instinct MI210 GigaIO	181.04	256	6554	8816 (1.4x)	17232 (2.0x)	16892 (1.9x)
🔴 2x Instinct MI210 GigaIO	90.52	128	3277	7245 (1.1x)	12050 (1.4x)	13539 (1.5x)
🔴 1x Instinct MI210 GigaIO	45.26	64	1638	6347 (59%)	8486 (40%)	9105 (43%)

🔴 4x Instinct MI210	181.04	256	6554	17075 (2.6x)	31408 (3.6x)	30643 (3.5x)
🔴 2x Instinct MI210	90.52	128	3277	9624 (1.5x)	15909 (1.8x)	16156 (1.8x)
🔴 1x Instinct MI210	45.26	64	1638	6454 (60%)	8757 (41%)	8751 (41%)

🔴 3x MI50 32GB + 🟢 1x A100 40GB	52.99	128	4096	13159 (3.0x)	22759 (2.7x)	11953 (2.7x)
🔴 3x Instinct MI50 32GB	39.74	96	3072	11709 (2.6x)	21693 (2.6x)	9969 (2.3x)
🔴 2x Instinct MI50 32GB	26.50	64	2048	7803 (1.8x)	14484 (1.7x)	6647 (1.5x)
🔴 1x Instinct MI50 32GB	13.25	32	1024	4446 (66%)	8477 (64%)	4406 (33%)

🔴 8x Radeon VII	110.64	128	8192	21946 (4.5x)	30826 (4.0x)	24572 (4.7x)
🔴 4x Radeon VII	55.32	64	4096	12911 (2.6x)	24273 (3.1x)	17080 (3.2x)
🔴 2x Radeon VII	27.66	32	2048	8113 (1.7x)	15591 (2.0x)	10352 (2.0x)
🔴 1x Radeon VII	13.83	16	1024	4898 (73%)	7778 (58%)	5256 (40%)

🔴 1x Radeon Pro Duo (2 GPUs)	16.38	8	1024	1945 (29%)	3310 (25%)	2879 (22%)
🔴 1x R9 295X2 (2 GPUs)	11.47	8	640	1727 (41%)	2428 (29%)	1989 (24%)
🔴 1x HD 7990 (2 GPUs)	8.19	6	576	1552 (41%)	1996 (27%)	2314 (31%)
🔴 1x HD 6990 (2 GPUs)	5.41	4	320	344 (16%)	214 ( 5%)	255 ( 6%)
🔴 1x HD 5970 (2 GPUs)	3.76	2	256	360 (22%)	181 ( 5%)	224 ( 7%)

🔵 4x DC GPU Max 1100	88.88	192	4915	12162 (3.5x)	22777 (3.7x)	11759 (3.6x)
🔵 2x DC GPU Max 1100	44.44	96	2458	6301 (1.8x)	11815 (1.9x)	5970 (1.8x)
🔵 1x DC GPU Max 1100	22.22	48	1229	3487 (43%)	6209 (39%)	3252 (20%)

🟢 8x B200 SXM6 180GB	595.60	1440	64000	132105 (3.1x)	219300 (3.9x)	128668 (5.7x)
🟢 4x B200 SXM6 180GB	297.80	720	32000	94073 (2.2x)	147446 (2.7x)	72670 (3.2x)
🟢 2x B200 SXM6 180GB	148.90	360	16000	62046 (1.5x)	85077 (1.5x)	39275 (1.7x)
🟢 1x B200 SXM6 180GB	74.45	180	8000	42152 (81%)	55609 (54%)	22695 (22%)

🟢 8x H200 SXM5 141GB	535.28	1128	38400	92008 (4.0x)	157743 (4.3x)	113490 (5.6x)
🟢 4x H200 SXM5 141GB	267.64	564	19200	57310 (2.5x)	96056 (2.6x)	63117 (3.1x)
🟢 2x H200 SXM5 141GB	133.82	282	9600	36711 (1.6x)	57070 (1.6x)	34871 (1.7x)
🟢 1x H200 SXM5 141GB	66.91	141	4800	23056 (73%)	36610 (59%)	20291 (33%)

🟢 4x H100 NVL	241.28	376	15752	44284 (2.2x)	82122 (2.5x)	53855 (2.9x)
🟢 2x H100 NVL	120.64	188	7876	29050 (1.4x)	49958 (1.5x)	30586 (1.7x)
🟢 1x H100 NVL	60.32	94	3938	20303 (79%)	32922 (64%)	18424 (36%)

🟢 4x H100 SXM5 80GB HBM3	267.63	320	13400	46442 (2.7x)	78462 (2.8x)	60490 (3.0x)
🟢 2x H100 SXM5 80GB HBM3	133.82	160	6700	26838 (1.6x)	46189 (1.6x)	34147 (1.7x)
🟢 1x H100 SXM5 80GB HBM3	66.91	80	3350	17262 (79%)	28522 (66%)	20065 (46%)

🟢 4x A100 PCIe 80GB	77.96	320	7740	25957 (2.7x)	52056 (2.9x)	33283 (3.1x)
🟢 2x A100 PCIe 80GB	38.98	160	3870	15742 (1.6x)	27165 (1.5x)	17510 (1.6x)
🟢 1x A100 PCIe 80GB	19.49	80	1935	9657 (76%)	17896 (71%)	10817 (43%)

🟢 4x PG506-243 / PG506-242	88.57	256	6554	23097 (2.8x)	41088 (2.6x)	36130 (2.9x)
🟢 2x PG506-243 / PG506-242	44.28	128	3277	13885 (1.7x)	24168 (1.5x)	20906 (1.7x)
🟢 1x PG506-243 / PG506-242	22.14	64	1638	8195 (77%)	15654 (74%)	12271 (58%)

🟢 8x A100 SXM4 40GB	155.92	320	12440	37619 (4.4x)	72965 (4.6x)	63009 (7.2x)
🟢 4x A100 SXM4 40GB	77.96	160	6220	23411 (2.7x)	42400 (2.7x)	29017 (3.3x)
🟢 2x A100 SXM4 40GB	38.98	80	3110	14311 (1.7x)	23707 (1.5x)	15512 (1.8x)
🟢 1x A100 SXM4 40GB	19.49	40	1555	8543 (84%)	15917 (79%)	8748 (43%)

🟢 4x Tesla V100 SXM2 32GB	62.68	128	3600	13135 (2.9x)	26527 (3.0x)	22686 (3.1x)
🟢 2x Tesla V100 SXM2 32GB	31.34	64	1800	7953 (1.8x)	15469 (1.7x)	12932 (1.8x)
🟢 1x Tesla V100 SXM2 32GB	15.67	32	900	4471 (76%)	8947 (77%)	7217 (62%)

🟢 3x K40m + 1x Titan Xp	17.16	48	1154	3117 (2.8x)	5174 (2.8x)	3127 (3.4x)
🟢 2x Tesla K40m	8.58	24	577	1971 (1.7x)	3300 (1.8x)	1801 (2.0x)
🟢 1x Tesla K40m	4.29	12	288	1131 (60%)	1868 (50%)	912 (24%)

🟢 1x Tesla K80 (2 GPUs)	8.22	24	480	2086 (2.3x)	3448 (2.1x)	2174 (2.3x)
🟢 1x Tesla K80 (1 GPU)	4.11	12	240	916 (58%)	1642 (53%)	943 (30%)

🟢 2x L40S	183.22	96	1728	6888 (1.8x)	13099 (1.8x)	13640 (1.8x)
🟢 1x L40S	91.61	48	864	3824 (68%)	7463 (67%)	7669 (68%)

🟢 2x L40	181.04	96	1728	7137 (1.8x)	13547 (1.7x)	14164 (1.8x)
🟢 1x L40	90.52	48	864	3870 (69%)	7778 (69%)	7945 (71%)

🟢 8x RTX A6000	320.00	384	6144	19311 (4.4x)	40063 (4.5x)	39004 (4.6x)
🟢 4x RTX A6000	160.00	192	3072	14314 (3.2x)	27915 (3.2x)	27227 (3.2x)
🟢 2x RTX A6000	80.00	96	1536	8041 (1.8x)	15026 (1.7x)	14795 (1.7x)
🟢 1x RTX A6000	40.00	48	768	4421 (88%)	8814 (88%)	8533 (86%)

🟢 4x A16 (16 GPUs)	71.84	256	3200	12078 (11.7x)	22451 (11.1x)	14489 (12.8x)
🟢 2x A16 (8 GPUs)	35.92	128	1600	6389 (6.2x)	11777 (5.8x)	7500 (6.6x)
🟢 1x A16 (4 GPUs)	17.96	64	800	3497 (3.4x)	6348 (3.1x)	3919 (3.5x)
🟢 1x A16 (2 GPUs)	8.98	32	400	1912 (1.9x)	3475 (1.7x)	2064 (1.8x)
🟢 1x A16 (1 GPU)	4.49	16	200	1031 (79%)	2031 (78%)	1133 (44%)

🟢 2x A2	9.06	30	400	1927 (1.9x)	3539 (1.7x)	2232 (1.9x)
🟢 1x A2	4.53	15	200	1031 (79%)	2051 (79%)	1199 (46%)

🟢 2x Quadro RTX 8000 Pa.	29.86	96	1248	4767 (1.8x)	9607 (1.8x)	10214 (1.8x)
🟢 1x Quadro RTX 8000 Pa.	14.93	48	624	2591 (64%)	5408 (67%)	5607 (69%)

🟢 7x 2080 Ti + 1x A100 40GB	107.60	88	4928	16146 (5.1x)	33732 (5.0x)	33857 (4.9x)
🟢 4x GeForce RTX 2080 Ti	53.80	44	2464	9117 (2.9x)	18415 (2.7x)	18598 (2.7x)
🟢 2x GeForce RTX 2080 Ti	26.90	22	1232	5085 (1.6x)	10770 (1.6x)	10922 (1.6x)
🟢 1x GeForce RTX 2080 Ti	13.45	11	616	3194 (79%)	6700 (84%)	6853 (86%)

🟢 1x GTX 690 (2 GPUs)	6.26	4	385	587 (23%)	920 (18%)	535 (11%)

🔵 2x Arc A770	39.32	32	1120	4954 (1.9x)	8745 (1.9x)	8329 (1.8x)
🔵 1x Arc A770	19.66	16	560	2663 (73&)	4568 (63%)	4519 (62%)

🟢 1x A100 40GB + 🟢 1x P100 16GB + 🟢 2x A2 + 🔴 3x MI50 32GB + 🔵 1x A770 16GB	54.36	180	2400	9903 (63%)	17296 (55%)	12041 (39%)
🔴 1x 7700 XT + 🔵 1x B580 + 🟢 1x Titan Xp	36.45	36	1296	3979 (47%)	8003 (48%)	8358 (39%)
🔵 1x A770 + 🟢 1x Titan Xp	24.30	24	1095	4717 (66%)	8380 (59%)	8026 (56%)

FAQs

General

How to learn using FluidX3D?

Follow the FluidX3D Documentation!
FluidX3D only uses FP32 or even FP32/FP16, in contrast to FP64. Are simulation results physically accurate?

Yes, in all but extreme edge cases. The code has been specially optimized to minimize arithmetic round-off errors and make the most out of lower precision. With these optimizations, accuracy in most cases is indistinguishable from FP64 double-precision, even with FP32/FP16 mixed-precision. Details can be found in this paper.
Why don't you use CUDA? Wouldn't that be more efficient?

No, that is a wrong myth. OpenCL is exactly as efficient as CUDA on Nvidia GPUs if optimized properly. Here I did roofline model and analyzed OpenCL performance on various hardware. OpenCL efficiency on modern Nvidia GPUs can be 100% with the right memory access pattern, so CUDA can't possibly be any more efficient. Without any performance advantage, there is no reason to use proprietary CUDA over OpenCL, since OpenCL is compatible with a lot more hardware.
FluidX3D in Blender?

FluidX3D is a standalone software, with integrated in-situ rendering engine that directly accesses the raw simulation data in VRAM. Integration into Blender or other software packages is not planned.
Checkpoint support?

FluidX3D does not support saving/loading checkpoints. I have made the software so robust that it does not crash, and so fast that restarting a simulation does not take longer than a few hours at most on a GPU system.

Model

What physical model does FluidX3D use?

FluidX3D implements the lattice Boltzmann method, a type of direct numerical simulation (DNS), the most accurate type of fluid simulation, but also the most computationally challenging. Optional extension models implemented in FluidX3D include volume force (Guo forcing), free surface (volume-of-fluid and PLIC), a temperature model and Smagorinsky-Lilly subgrid turbulence model for DNS-LES simulations.
What boundary types are available in FluidX3D?

Periodic boundaries, solid boundaries, equilibrium boundaries, and moving solid boundaries.
See the relevant section in the FluidX3D Documentation!
Does FluidX3D support adaptive mesh refinement?

No, not yet. Grid cell size is the same everywhere in the simulation box.
Can FluidX3D model both water and air at the same time?

No. FluidX3D can model either water or air, but not both at the same time. For free surface simulations with the SURFACE extension, I went with a volume-of-fluid/PLIC modeling approach as that provides a sharp water-air interface, so individual droplets can be resolved as small as 3 grid cells in diameter. However this model ignores the gas phase completely, and only models the fluid phase with LBM as well as the surface tension. An alternative I had explored years ago was the phase-field models (simplest of them is Shan-Chen model) - they model both fluid and gas phases, but struggle with the 1:1000 density contrast of air:water, and the modeled interface is diffuse over ~5 grid cells. So the smallest resolved droplets are ~10 grid cells in diameter, meaning for the same resolution you need ~37x the memory footprint - infeasible on GPUs. Coming back to VoF model, it is possible to extend it with a model for the gas phase, but one has to manually track bubble split/merge events, which makes this approach very painful in implementation and poorly performing on the hardware.
Can FluidX3D compute lift/drag forces?

Yes. See the relevant section in the FluidX3D Documentation!
Can FluidX3D simulate transsonic/supersonic flows?

No. The LBM model in FluidX3D works only in the weakly compressible regime at Mach numbers < 0.3.
How to simulate moving/rotating geometry?

See the section about .stl voxelization in FluidX3D Documentation.
Can FluidX3D model thermal convection?

Yes. See the relevant section in the FluidX3D Documentation!
Can FluidX3D model chemical reactions?

No.
Can FluidX3D model particles?

Yes, either as passive tracer particles or with 2-way coupling (buoyancy). See the relevant section in the FluidX3D Documentation!
I see white lines or everything disappeared. Why?

This indicates that the simulation has become unstable. See the relevant section in the FluidX3D Documentation!
The density and/or velocity field are oscillatory. Why?

Either this is acoustic (standing) waves, or non-physical artifacts. Play around with SI-LBM unit conversion, and reduce time step size in SI units by reducing lbm_u.

Why no multi-relaxation-time (MRT) collision operator?

The idea of MRT is to linearly transform the DDFs into "moment space" by matrix multiplication and relax these moments individually, promising better stability and accuracy. In practice, in the vast majority of cases, it has zero or even negative effects on stability and accuracy, and simple SRT is much superior. Apart from the kinematic shear viscosity and conserved terms, the remaining moments are non-physical quantities and their tuning is a blackbox. Although MRT can be implemented in an efficient manner with only a single matrix-vector multiplication in registers, leading to identical performance compared to SRT by remaining bandwidth-bound, storing the matrices vastly elongates and over-complicates the code for no real benefit.

Hardware

Can FluidX3D run on multiple GPUs at the same time?

Yes. The simulation grid is then split in domains, one for each GPU (domain decomposition method). The GPUs essentially pool their memory, enabling much larger grid resolution and higher performance. Rendering is parallelized across multiple GPUs as well; each GPU renders its own domain with a 3D offset, then rendered frames from all GPUs are overlayed with their z-buffers. Communication between domains is done over PCIe, so no SLI/Crossfire/NVLink/InfinityFabric is required. All GPUs must however be installed in the same node (PC/laptop/server). Even unholy combinations of AMD+Intel+Nvidia GPUs will work, although it is recommended to only use GPUs with similar memory capacity and bandwidth together. Using a fast gaming GPU and slow integrated GPU together would only decrease performance due to communication overhead.
Can I run FluidX3D on the CPU?

Yes, and this is especially useful when you need more memory than a GPU can offer. You only need to install the Intel CPU Runtime for OpenCL.
I'm on a budget and have only a cheap computer. Can I run FluidX3D on my toaster PC/laptop?

Absolutely. Today even the most inexpensive hardware, like integrated GPUs or entry-level gaming GPUs, support OpenCL. You might be a bit more limited on memory capacity and grid resolution, but you should be good to go. I've tested FluidX3D on very old and inexpensive hardware and even on my Samsung S9+ smartphone, and it runs just fine, although admittedly a bit slower.
I don't have an expensive workstation GPU, but only a gaming GPU. Will performance suffer?

No. Efficiency on gaming GPUs is exactly as good as on their "professional"/workstation counterparts. Performance often is even better as gaming GPUs have higher boost clocks.
Do I need a GPU with ECC memory?

No. Gaming GPUs work just fine. Some Nvidia GPUs automatically reduce memory clocks for compute applications to almost entirely eliminate memory errors.
My GPU does not support CUDA. Can I still use FluidX3D?

Yes. FluidX3D uses OpenCL and not CUDA, so it runs on any GPU from any vendor since around 2009.
I don't have a dedicated graphics card at all. Can I still run FluidX3D on my PC/laptop?

Yes. FluidX3D also runs on all integrated GPUs since around 2012, and also on CPUs.
In the benchmarks you list some very expensive hardware. How do you get access to that?

As a PhD candidate in computational physics, I used FluidX3D for my research, so I had access to BZHPC, SuperMUC-NG, JSC JURECA-DC, and Leonardo supercomputers.

Graphics

I don't have an RTX/DXR GPU that supports raytracing. Can I still use raytracing graphics in FluidX3D?

Yes, and at full performance. FluidX3D does not use a bounding volume hierarchy (BVH) to accelerate raytracing, but fast ray-grid traversal instead, implemented directly in OpenCL C. This is much faster than BVH for moving isosurfaces in the LBM grid (~N vs. ~N²+log(N) runtime; LBM itself is ~N³), and it does not require any dedicated raytracing hardware. Raytracing in FluidX3D runs on any GPU that supports OpenCL 1.2.
I have a datacenter/mining GPU without any video output or graphics hardware. Can FluidX3D still render simulation results?

Yes. FluidX3D does all rendering (rasterization and raytracing) in OpenCL C, so no display output and no graphics features like OpenGL/Vulkan/DirectX are required. Rendering is just another form of compute after all. Rendered frames are passed to the CPU over PCIe and then the CPU can either draw them on screen through dedicated/integrated graphics or write them to the hard drive.
I'm running FluidX3D on a remote (super-)computer and only have an SSH terminal. Can I still use graphics somehow?

Yes, either directly as interactive ASCII graphics in the terminal or by storing rendered frames on the hard drive and then copying them over via `scp -r user@server.url:"~/path/to/images/folder" .`.
Graphics support on Apple macOS?

On macOS and Android, INTERACTIVE_GRAPHICS mode is not supported, as no X11 is available. You can still use INTERACTIVE_GRAPHICS_ASCII though, or render video to the hard drive with regular GRAPHICS mode.

Licensing

I want to learn about programming/software/physics/engineering. Can I use FluidX3D for free?

Yes. Anyone can use FluidX3D for free for public research, education or personal use. Use by scientists, students and hobbyists is free of charge and well encouraged.
I am a scientist/teacher with a paid position at a public institution. Can I use FluidX3D for my research/teaching?

Yes, you can use FluidX3D free of charge. This is considered research/education, not commercial use. To give credit, the references listed below should be cited. If you publish data/results generated by altered source versions, the altered source code must be published as well.
I work at a company in CFD/consulting/R&D or related fields. Can I use FluidX3D commercially?

No. Commercial use is not allowed with the current license.
Is FluidX3D open-source?

No. "Open-source" as a technical term is defined as freely available without any restriction on use, but I am not comfortable with that. I have written FluidX3D in my spare time and no one should milk it for profits while I remain uncompensated, especially considering what other CFD software sells for. The technical term for the type of license I choose is "source-available no-cost non-commercial". The source code is freely available, and you are free to use, to alter and to redistribute it, as long as you do not sell it or make a profit from derived products/services, and as long as you do not use it for any military purposes (see the license for details).
Will FluidX3D at some point be available with a commercial license?

Maybe I will add the option for a second, commercial license later on. If you are interested in commercial use, let me know. For non-commercial use in science and education, FluidX3D is and will always be free.

External Code/Libraries/Images used in FluidX3D

OpenCL-Headers and C++ Wrapper for GPU parallelization (Khronos Group)
Win32 API for interactive graphics in Windows (Microsoft)
X11/Xlib for interactive graphics in Linux (The Open Group)
marching-cubes tables for isosurface generation on GPU (Paul Bourke)
src/lodepng.cpp and src/lodepng.hpp for .png encoding and decoding (Lode Vandevenne)
SimplexNoise class in src/utilities.hpp for generating continuous noise in 2D/3D/4D space (Stefan Gustavson)
skybox/skybox8k.png for free surface raytracing (HDRI Hub)

References

Lehmann, M.: Computational study of microplastic transport at the water-air interface with a memory-optimized lattice Boltzmann method. PhD thesis, (2023)
Lehmann, M.: Esoteric Pull and Esoteric Push: Two Simple In-Place Streaming Schemes for the Lattice Boltzmann Method on GPUs. Computation, 10, 92, (2022)
Lehmann, M., Krause, M., Amati, G., Sega, M., Harting, J. and Gekle, S.: Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats. Phys. Rev. E 106, 015308, (2022)
Lehmann, M.: Combined scientific CFD simulation and interactive raytracing with OpenCL. IWOCL'22: International Workshop on OpenCL, 3, 1-2, (2022)
Lehmann, M., Oehlschlägel, L.M., Häusl, F., Held, A. and Gekle, S.: Ejection of marine microplastics by raindrops: a computational and experimental study. Micropl.&Nanopl. 1, 18, (2021)
Lehmann, M.: High Performance Free Surface LBM on GPUs. Master's thesis, (2019)
Lehmann, M. and Gekle, S.: Analytic Solution to the Piecewise Linear Interface Construction Problem and Its Application in Curvature Calculation for Volume-of-Fluid Simulation Codes. Computation, 10, 21, (2022)

Contact

FluidX3D is solo-developed and maintained by Dr. Moritz Lehmann.
For any questions, feedback or other inquiries, contact me at dr.moritz.lehmann@gmail.com.
Updates are posted on Mastodon via @ProjectPhysX/#FluidX3D and on YouTube.

Support

I'm developing FluidX3D in my spare time, to make computational fluid dynamics lightning fast, accessible on all hardware, and free for everyone.

You can support FluidX3D by reporting any bugs or things that don't work in the issues. I'm welcoming feedback!
If you like FluidX3D, share it with friends and colleagues. Spread the word that CFD is now lightning fast, accessible and free.
If you want to support FluidX3D financially, you can sponsor me on GitHub or buy me a coffee. Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 333 Commits
skybox		skybox
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
DOCUMENTATION.md		DOCUMENTATION.md
FluidX3D.sln		FluidX3D.sln
FluidX3D.vcxproj		FluidX3D.vcxproj
LICENSE.md		LICENSE.md
README.md		README.md
make.sh		make.sh
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FluidX3D

How to get started?

Compute Features - Getting the Memory Problem under Control

Solving the Visualization Problem

Solving the Compatibility Problem

Single-GPU/CPU Benchmarks

Multi-GPU Benchmarks

FAQs

General

Model

Hardware

Graphics

Licensing

External Code/Libraries/Images used in FluidX3D

References

Contact

Support

About

Uh oh!

Releases 31

Uh oh!

Languages

License

ProjectPhysX/FluidX3D

Folders and files

Latest commit

History

Repository files navigation

FluidX3D

How to get started?

Compute Features - Getting the Memory Problem under Control

Solving the Visualization Problem

Solving the Compatibility Problem

Single-GPU/CPU Benchmarks

Multi-GPU Benchmarks

FAQs

General

Model

Hardware

Graphics

Licensing

External Code/Libraries/Images used in FluidX3D

References

Contact

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 31

Uh oh!

Languages