Using multiple GPUs with CUDA

One HPC-user is developping a CUDA kernel. The goal is to extract features (histogram) from a data volume (example 1024x1024x1024 float numbers).

How to accelerate a CUDA code for using multiple GPU at the same time ?

Mono GPU version:

  • Get data samples in the CPU memory (the “host”)
  • Send data to the GPU (the “device”)
  • The GPU kernel computes the histogram
  • The histogram is send back from GPU to the CPU

n GPUs version:

  • Get data samples in the CPU memory.
  • Slice the data into n-chunks, and send each data chunk to each GPU.
  • The n-GPUs kernels computes their local histogram.
  • The n-GPUs communicate to compute the global histogram (all reduce operation).
  • The histogram is send back from any GPU (e.g., arbitrary chosen the 0) to the CPU.

Solution (need to be a UL HPC member:

First you need NCCL library.

Including the NCCL library in the code :

#include <nccl.h>

Initializing NCCL

const int nDev=2; // change the value here
int devs[nDev];
for (int i=0;i<nDev;i++)
ncclComm_t comm[nDev];
ncclCommInitAll(comm, nDev, devs);  // Using GPU devices 0 and 1

Selecting one GPU with “i” the GPU identifier:


Combining all arrays (Sum) between all GPUs:

ncclAllReduce(d_histogram[i], d_histogram[i], numBins, ncclInt, ncclSum, comm[i], NULL);

Destructor of NCCL

for (int i=0;i<nDev;i++)


The code is 2 faster with 2 GPUs, and more, each the memory consumption is divided by 2 on each GPU.