Aion Node vs Meluxina Node with QMeCha

Wait, it went from 600+ to 112 for the same unoptimized setup?

Regarding the speed up, this is my theory:

Without binding, the OS is free to move the process (task) within the NUMA node to run its own threads. When it gets moved, some nodes end up sharing a core. This is much worse than Meluxina, since there there are 2 threads per core, and the OS can run its thread on a core where a process is running, so the OS does not need to move processes that often. This is why binding is so crucial when you disable hyper threading.

When you bind your process, the OS simply suspends the process, runs it thread, and then restarts the process. The balanced distribution of processes along cores is not disturbed, so everything moves more smoothly.

With core binding it takes 26.5 seconds, compatible with the results presented in the first post obtained with Meluxina.
Without core binding the job takes 112.3 secs, because at some point some MPI tasks start running on cores that are already occupied, leaving some other cores unused, and slowing down the calculation.

P.S. In the script I always define the OMP_NUM_THREADS, MKL_NUM_THREADS was a left over since I do not use MKL. The tests are run with OpenBLAS.

Also, I suppose the objective of your benchmark is to determine what is the optimal setup for your program, i.e. balance between processes and threads. I don’t know how much you are using OpenMP, but binding OpenMP threads could also improve performance. Basically, prevent everything from moving around.

Some numerical libraries, like MUMPS for instance, work best with one process per NUMA node. So, you bind processes by core and map them to NUMA nodes (i.e. one process per NUMA node but fixed to its core), and then you open the OpenMP threads of the process and bind them to the core where they started. OpenMP threads open in the cores where the process was mapped, i.e. NUMA node.

This is Open MPI terminology, let me run a test in SLURM with MUMPS that I know well and we can discuss how you can optimize your SRUN commands.

I can’t tell what will be the best balance between processes and threads for your program without the design details, but typical settings are either one single threaded process per core, or one process per NUMA node using as many threads as there are cores in the NUMA node.

The OpenMP threads are bound with a combination of the OMP_PLACES and OMP_PROC_BIND variables. These work in combination with the bind options of SLURM, since SLURM will assign the cores that the process will use.

@xbesseron Any tips are welcome!

The benchmark obtained now is fine. The ratio between MPI/OMPs is system-dependent and depends on the memory allocations, number of parameters, type of simulation to be executed … ecc. ecc. The scope of the test was just to detect the origin of the slowdown because this same problem can affect other codes such as QuantumEspresso, Yambo, FHI-aims, Orca ecc. ecc.

The processors between Aion and MeluXina are the same. The difference is that Aion has hyperthreading disabled at the BIOS level so they are not visible.

Using --hint=nomultithread, you can tell SLURM that you don’t want to use the hyperthread cores, thus it will correctly use only one core per ‘hyperthreading unit’. You can use this option in your script on both platforms to get more consistent results. It won’t hurt to have it on Aion.

Without the --cpu-bind=cores option, the operating system is free to move the processes around from one core to another. There might be valid reasons to do that, for example to execute another process, for example a system service like mmfsd which is related to the GPFS filesystem. But doing that, the OS might also move your application process to another core, and not moving it back immediately. And causes the problem you had.

It would be interesting to run your script again on MeluXina with --cpu-bind=verbose to see what is the default binding. It could be that a different version of SLURM, or a different configuration, causes a different binding by default, hence the difference in performance. And then run it again on MeluXina with --cpu-bind=verbose,cores. The processors are the same, you should be able to get similar performance on both platforms in the end.

As @gkaf mentioned, there are other options that you can try to further optimized your MPI+OpenMP execution.

Thank you Xavier,
from the comparisons in the table reported at the beginning, a part from noise, I would say that the timings between AION and Meluxina are now perfectly in agreement (also for the pure MPI case).

Also, I tested the run on 10 nodes and I didn’t see any particular slowdowns on AION the weak scaling seems to be holding, with pure MPI.