Ensure you have have requested a GPU node
On the ULHPC, you can rapidly know if you are on a GPU node by running the following command: nvidia-smi
If you see nvidia-smi: command not found
you are not on a GPU node. You can request one in an interactive session via the following command: si-gpu
.
If you are on a GPU node, you should see something like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:1D:00.0 Off | 0 |
| N/A 33C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
For more information about GPU nodes jobs you can go directly on the documentation.
Ensure you are using the GPU
In this knowledge nugget, I will use a Tensorflow container in three different cases. The code of the script is as follows:
import tensorflow as tf
# Debugging to check where instructions are placed, CPU or GPU
tf.debugging.set_log_device_placement(True)
with tf.device('/GPU:0'):
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
Correct: with --nv on a GPU node
We use --nv
to load the host nvidia drivers and allow the container to use the GPU. See below an example command:
singularity run --nv tensorflow2.sif python script.py
We can check that everything is fine from the log as a GPU (Tesla V100) has been found and the MatMul
operation has been processed on the GPU.
Found device 0 with properties:
pciBusID: 0000:1d:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
Incorrect: without --nv on a GPU node
If the --nv
option is forgotten, the script will still be successful. This is because Tensorflow can run several operations on the CPU if no GPU are found.
singularity run tensorflow2.sif python script.py
We can see that something is wrong in the logs: the NVIDIA driver cannot be found and the MatMal
operation is executed on the CPU.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Executing op MatMul in device /job:localhost/replica:0/task:0/device:CPU:0
Incorrect: with --nv on a non-GPU node
In this case, we entered the correct options, see below:
singularity run --nv tensorflow2.sif python script.py
However, the node is not a GPU node and thus no NVIDIA files can be found on the host. As you can see here, the script is yet again successful as Singularity considers the wrong --nv
option as a non-blocking and Tensorflow can execute the MatMul
operation on the CPU.
INFO: Could not find any nv files on this host!
Executing op MatMul in device /job:localhost/replica:0/task:0/device:CPU:0