Most current accelerated computing nodes are based on a host-device architecture, where the host is based on a general-purpose processor, to which one or more devices are adjoined to offload some operations. On classical, non-accelerated nodes, only the host is present.
The most common devices are GPGPUs (General Purpose Graphical Processing Unit), but other types of accelerators such as NEC's Vector Engine or FPGA (Field Programmable Gate Array) could be considered.
Though accelerators can provide for greater performance and energy efficiency, Leveraging their power cannot be done without adaptations in the programming model.
Accelerators usually have dedicated memory, with high bandwidth, which is separate from the host memory. Copying between the host and device memory occurs latency, and must be done as infrequently as possible. Mainstream accelerator programming models may provide both separate host and device memory accesses, with explicit exchanges, and "unified shared memory", allowing access of memory both from the host and devices in an almost transparent manner, so as to provide programmability and maintainability. The underlying implementation is usually based on a paging mechanism, which may incur additional latency if not sufficiently well understood and used (so the associated programming models may provide functions for prefetching or providing hints to as how the memory is actually used).
Available memory on devices is often more limited than that on the host, so allocating everything explicitly on the device could cause us to run out of available memory. Using unified shared memory can avoid this, as memory paging may provide the "illusion" of having more memory on the device, though this can seriously degrade performance when this mechanism kicks in.
Ideally, we could use unified shared memory in all cases, and this might be done in the future, it seems safer for the present to provide control to the developer over which type of memory is used. So in code_saturne, when allocating memory which might be needed on and accelerator, the CS_MALLOC_HD function should be used, specifying the allocation type with a cs_alloc_mode_t argument. In a manner similar to the older BFT_MALLOC macro, this provides some instrumentation, and is construed as a portability layer between several programming models.
As classical C and current C++ or Fortran standards cannot express all the possible parallelism, programming for accelerators may be done using either:
Note that the mainstream language extensions listed above, as well as Kokkos, and many DSLs are all based on C++.
None of these approches is currently as ubiquitous or portable as the C and Fortran basis with host-based OpenMP directives on which most of code_saturne is built:
Given these constraints, the current strategy regarding accelerator support is the following:
Host-level parallelism is currently based on OpenMP constructs. Parallel loops using threads are used where possible. Though this is not used yet, OpenMP tasks could also be used to benefit from additional parallelization opportunities.
Vectorization may also be used locally to enhance performance, whether handled automatically by the compiler or handled explicitely through directives. In practice, as code_saturns is mostly memory-bound, the benefits of vectorization are limited, so improving the vectorization of various algorithms is not a priority.
Various devices may be considerd, but the main targets are currently GPGPUs.
As mentioned above, exploiting parallelism can be based on CUDA, DPC++, or OpenMP directives.
Note that parallelism on GPGPU's is usually based on massive multi-threading, where operations on an array may usually be divided into a series of chunks (blocs), where each block is scheduled to run on available processors (ideally in unspecified order), and computation of a given block is itself multi-threaded.
Computational kernels launched on a device from the host are usually at least in part asynchronous with the host (at least with CUDA and OneAPI/DPC++), so that parallelism between the host and device may be exploited when the algorithm allows this.