Context:
One HPC-user is developping a CUDA kernel. The goal is to extract features (histogram) from a data volume (example 1024x1024x1024 float numbers).
Challenge:
How to accelerate a CUDA code for using multiple GPU at the same time ?
Mono GPU version:
- Get data samples in the CPU memory (the “host”)
- Send data to the GPU (the “device”)
- The GPU kernel computes the histogram
- The histogram is send back from GPU to the CPU
n GPUs version:
- Get data samples in the CPU memory.
- Slice the data into n-chunks, and send each data chunk to each GPU.
- The n-GPUs kernels computes their local histogram.
- The n-GPUs communicate to compute the global histogram (all reduce operation).
- The histogram is send back from any GPU (e.g., arbitrary chosen the 0) to the CPU.
Solution (need to be a UL HPC member:
https://github.com/ULHPC/GPUguru/blob/main/histogram.cu
First you need NCCL library.
Including the NCCL library in the code :
#include <nccl.h>
Initializing NCCL
const int nDev=2; // change the value here int devs[nDev]; for (int i=0;i<nDev;i++) devs[i]=i; ncclComm_t comm[nDev]; ncclCommInitAll(comm, nDev, devs); // Using GPU devices 0 and 1
Selecting one GPU with “i” the GPU identifier:
cudaSetDevice(i);
Combining all arrays (Sum) between all GPUs:
ncclAllReduce(d_histogram[i], d_histogram[i], numBins, ncclInt, ncclSum, comm[i], NULL);
Destructor of NCCL
for (int i=0;i<nDev;i++) ncclCommDestroy(comm[i]);
Conclusion:
The code is 2 faster with 2 GPUs, and more, each the memory consumption is divided by 2 on each GPU.