How to use a GPU with your Singularity container

julien.schleich · 17 August 2022 11:23

Ensure you have have requested a GPU node

On the ULHPC, you can rapidly know if you are on a GPU node by running the following command: nvidia-smi

If you see nvidia-smi: command not found you are not on a GPU node. You can request one in an interactive session via the following command: si-gpu.

If you are on a GPU node, you should see something like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1D:00.0 Off |                    0 |
| N/A   33C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

For more information about GPU nodes jobs you can go directly on the documentation.

Ensure you are using the GPU

In this knowledge nugget, I will use a Tensorflow container in three different cases. The code of the script is as follows:

import tensorflow as tf
# Debugging to check where instructions are placed, CPU or GPU
tf.debugging.set_log_device_placement(True)
with tf.device('/GPU:0'):
  a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
  c = tf.matmul(a, b)

Correct: with --nv on a GPU node

We use --nv to load the host nvidia drivers and allow the container to use the GPU. See below an example command:

singularity run --nv tensorflow2.sif python script.py

We can check that everything is fine from the log as a GPU (Tesla V100) has been found and the MatMul operation has been processed on the GPU.

Found device 0 with properties:
pciBusID: 0000:1d:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0

Incorrect: without --nv on a GPU node

If the --nv option is forgotten, the script will still be successful. This is because Tensorflow can run several operations on the CPU if no GPU are found.

singularity run tensorflow2.sif python script.py

We can see that something is wrong in the logs: the NVIDIA driver cannot be found and the MatMal operation is executed on the CPU.

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Executing op MatMul in device /job:localhost/replica:0/task:0/device:CPU:0

Incorrect: with --nv on a non-GPU node

In this case, we entered the correct options, see below:

singularity run --nv tensorflow2.sif python script.py

However, the node is not a GPU node and thus no NVIDIA files can be found on the host. As you can see here, the script is yet again successful as Singularity considers the wrong --nv option as a non-blocking and Tensorflow can execute the MatMul operation on the CPU.

INFO: Could not find any nv files on this host!
Executing op MatMul in device /job:localhost/replica:0/task:0/device:CPU:0