GPU calculations pecularities¶
LCODE 3D performs most of the calculations on GPU using a mix of two approaches.
CUDA kernels with numba.cuda¶
One can use CUDA from Python more or less directly by writing and launching CUDA kernels with
numba.cuda
.
An example would be:
@numba.cuda.jit
def add_two_arrays_kernel(arr1, arr2, result):
i = numba.cuda.grid(1)
if i >= arr1.shape[0]:
return
result[i] = arr1[i] + arr2[i]
This function represents a loop body, launched in parallel with many threads at once.
Each of them starts with obtaining the array index it is ‘responsible’ for with cuda.grid(1)
and then proceeds to perform the required calculation.
As it is optimal to launch them in 32-threaded ‘warps’, one also has to handle the case
of having more threads than needed by making them skip the calculation.
No fancy Python operations are supported inside CUDA kernels, it is basically a way to write C-like bodies for hot loops without having to write actual C/CUDA code. You can only use simple types for kernel arguments and you cannot return anything from them.
To rub it in, this isn’t even a directly callable function yet. To conceal the limitations and the calling complexity, it is convenient to write a wrapper for it.
def add_two_arrays(arr1, arr2):
result = cp.zeros_like(arr1) # uses cupy, see below
warp_count = int(ceil(arr1.size / WARP_SIZE))
add_two_arrays_kernel[warp_count, WARP_SIZE](arr1, arr2, result)
return result
A pair of numbers (warp_count
, WARP_SIZE
) is required to launch the kernel.
warp_count
is chosen this way so that warp_count * WARP_SIZE
would be
larger than the problem size.
-
lcode.
WARP_SIZE
= 32¶ Should be detectable with newer
cupy
(>6.0.0b2) asWARP_SIZE = cp.cuda.Device(config.gpu_index).attributes['WarpSize']
. As of 2019 it’s equal to 32 for all CUDA-capable GPUs. It’s even a hardcoded value incupy
.
Array-wise operations with cupy¶
cupy
is a GPU array library that aims to implement a numpy-like
interface to GPU arrays.
It allows one to, e.g., add up two GPU arrays with a simple and terse a + b
.
Most of the functions in LCODE use vectorized operations and cupy
.
All memory management is done with cupy
for consistency.
It’s hard to underestimate the convenience of this approach, but sometimes expressing algorithms
in vectorized notation is too hard or suboptimal.
The only two times we’re actually going for writing CUDA kernels are
deposit()
(our fine particle loop) and
move_smart()
(our coarse particle loop).
Copying is expensive¶
If the arrays were copied between GPU RAM and host RAM, the PCI-E bandwidth would become a bottleneck. The two most useful strategies to minimize excessive copying are
- churning for several consecutive \(\xi\)-steps
with no copying and no CPU-side data processing
(with a notable exception of
beam()
and the resulting beam_ro); and - copying only the subset of the arrays that the outer diagnostics code needs.
GPU array conversion¶
In order for a + b
to work in cupy
,
both arrays have to be copied to GPU (cupy.asarray(a)
) and,
in case you want the results back as numpy
arrays, you have to explicitly copy them back
(gpu_array.get()
).
While for the LCODE 3D itself it’s easier and quicker to stick to using GPU arrays exclusively,
this means the only time when we want to do the conversion to numpy
is when we are returning
the results back to the external code.
There are two classes that assist in copying the arrays back and forth and conveniently as possible. The implementation looks a bit nightmarish, but using them is simple.
-
class
lcode.
GPUArrays
(**kwargs)[source]¶ A convenient way to group several GPU arrays and access them with a dot.
x = GPUArrays(something=numpy_array, something_else=another_array)
will createx
withx.something
andx.something_else
stored on GPU.Do not add more attributes later, specify them all at construction time.
-
class
lcode.
GPUArraysView
(gpu_arrays)[source]¶ This is a magical wrapper around GPUArrays that handles GPU-RAM data transfer transparently. Accessing
view.something
will automatically copy array to host RAM, settingview.something = ...
will copy the changes back to GPU RAM.Usage:
view = GPUArraysView(gpu_arrays); view.something
Do not add more attributes later, specify them all at construction time.
NOTE: repeatedly accessing an attribute will result in repeated copying!
This way we can wrap everything we need in GPUArrays with, e.g.,
const = GPUArrays(x_init=x_init, y_init=y_init, ...)
and then access them as const.x_init
from GPU-heavy code.
For the outer code that does not care about GPU arrays at all,
we can return a wrapped const_view = GPUArraysView(const)
and access the arrays as const_view.x_init
.
Copying will happen on-demand during the attribute access,
intercepted by our __getattr__
implementation,
but beware!
Note
Repeatedly accessing const_view.x_init
will needlessly perform the copying again,
so one should bind it to the variable name (x_init = const_view.x_init
) once
and reuse the resulting numpy
array.
Todo
CODE: wrap the returned array with GPUArraysView by default
Selecting GPU¶
-
config_example.
gpu_index
= 0¶ Index of the GPU that should perform the calculations
LCODE 3D currently does not support utilizing several GPUs for one simulation, but once we switch to beam evolution calculation, processing several consecutive \(t\)-steps in a pipeline of several GPUs should be a low hanging fruit.