» Symbolically generated GPU-based LBM

Experimental generation of OpenCL kernels using SymPy, Mako and PyOpenCL.

This code was originally written for a seminar report on the advantages of symbolic generation of LBM implementations.

The source is available on Github and cgit alongside detailed benchmarks on both a older Nvidia K2200 and a more recent Nvidia Tesla P100. All in all the results of the CSE-optimized versions are surprisingly good, reaching up to 99.8% of the theoretical upper performance limit.

Performance

Theoretical maximum performance on tested hardware:

GPU Bandwidth D2Q9   D3Q19   D3Q27  
    single double single double single double
K2200 63.2 GiB/s 893 459 435 220 308 156
P100 512.6 GiB/s 7242 3719 3528 1787 2502 1262

Maximum measured performance…

GPU D2Q9   D3Q19   D3Q27  
  single double single double single double
K2200 843.4 326.4 423.2 163.8 303.0 116.0
P100 6957.4 3585.0 3420.2 1763.8 2374.6 1259.6

…relative to theoretical maximum

GPU D2Q9   D3Q19   D3Q27  
  single double single double single double
K2200 94.4% 71.1% 97.3% 74.5% 98.4% 74.4%
P100 96.1% 96.4% 96.9% 98.7% 94.9% 99.8%

For more details see the result/ and notebook/ directories.