Using multiple GPUs with CUDA

pierrick.pochelu · 26 June 2023 15:21

Context:
One HPC-user is developping a CUDA kernel. The goal is to extract features (histogram) from a data volume (example 1024x1024x1024 float numbers).

Challenge:
How to accelerate a CUDA code for using multiple GPU at the same time ?

Mono GPU version:

Get data samples in the CPU memory (the “host”)
Send data to the GPU (the “device”)
The GPU kernel computes the histogram
The histogram is send back from GPU to the CPU

n GPUs version:

Get data samples in the CPU memory.
Slice the data into n-chunks, and send each data chunk to each GPU.
The n-GPUs kernels computes their local histogram.
The n-GPUs communicate to compute the global histogram (all reduce operation).
The histogram is send back from any GPU (e.g., arbitrary chosen the 0) to the CPU.

Solution (need to be a UL HPC member:

https://github.com/ULHPC/GPUguru/blob/main/histogram.cu

First you need NCCL library.

Including the NCCL library in the code :

#include <nccl.h>

Initializing NCCL

const int nDev=2; // change the value here
int devs[nDev];
for (int i=0;i<nDev;i++)
  devs[i]=i;
ncclComm_t comm[nDev];
ncclCommInitAll(comm, nDev, devs);  // Using GPU devices 0 and 1

Selecting one GPU with “i” the GPU identifier:

cudaSetDevice(i);

Combining all arrays (Sum) between all GPUs:

ncclAllReduce(d_histogram[i], d_histogram[i], numBins, ncclInt, ncclSum, comm[i], NULL);

Destructor of NCCL

for (int i=0;i<nDev;i++)
   ncclCommDestroy(comm[i]);

Conclusion:

The code is 2 faster with 2 GPUs, and more, each the memory consumption is divided by 2 on each GPU.