Using multiple GPUs with CUDA

Context:
One HPC-user is developping a CUDA kernel. The goal is to extract features (histogram) from a data volume (example 1024x1024x1024 float numbers).

Challenge:
How to accelerate a CUDA code for using multiple GPU at the same time ?

Mono GPU version:

  • Get data samples in the CPU memory (the “host”)
  • Send data to the GPU (the “device”)
  • The GPU kernel computes the histogram
  • The histogram is send back from GPU to the CPU

n GPUs version:

  • Get data samples in the CPU memory.
  • Slice the data into n-chunks, and send each data chunk to each GPU.
  • The n-GPUs kernels computes their local histogram.
  • The n-GPUs communicate to compute the global histogram (all reduce operation).
  • The histogram is send back from any GPU (e.g., arbitrary chosen the 0) to the CPU.

Solution (need to be a UL HPC member:

https://github.com/ULHPC/GPUguru/blob/main/histogram.cu

First you need NCCL library.

Including the NCCL library in the code :

#include <nccl.h>

Initializing NCCL

const int nDev=2; // change the value here
int devs[nDev];
for (int i=0;i<nDev;i++)
  devs[i]=i;
ncclComm_t comm[nDev];
ncclCommInitAll(comm, nDev, devs);  // Using GPU devices 0 and 1

Selecting one GPU with “i” the GPU identifier:

cudaSetDevice(i);

Combining all arrays (Sum) between all GPUs:

ncclAllReduce(d_histogram[i], d_histogram[i], numBins, ncclInt, ncclSum, comm[i], NULL);

Destructor of NCCL

for (int i=0;i<nDev;i++)
   ncclCommDestroy(comm[i]);

Conclusion:

The code is 2 faster with 2 GPUs, and more, each the memory consumption is divided by 2 on each GPU.