Aion Node vs Meluxina Node with QMeCha

The tests run on the 30th of January 2024 have still shown some problematic behavior of the AION nodes with MPI/OMP calculations.

To highlight the problems I have first run a QMeCha test on one single node varying the ratio between MPI tasks and OMP threads.

Within this test, we expect that:

  • The timings should remain constant

  • The estimation of the energy must be identical within machine accuracy

For this purpose, we use the latest personal version of QMeCha beta (new_atm_cut_off branch).

Compilation of Meluxina uses open modules:

module load foss/2022b

while compilation on AION uses open modules:

module load foss 

The configuration options for the compilations on both machines are the same

module load foss 

The run the calculations on both HPCs we use both the srun and mpirun commands, yet since the results were comparable we only report the results obtained with srun, submitted using the command line:

Screenshot from 2024-01-30 13-48-29

Although the timings are compatible a problem with pure MPI calculations seems to appear.

Notice that in Meluxina hyperthreading is activated, yet I did not use it, using only 128 real cores.

To check the issue that appears for pure MPI calculations, I logged into the nodes (aion-0112) and checked with htop the balance of the cores’ usage:

The problem appears since some of the cores don’t seem to be working, thus their workload is assigned to the ones that are already doing other independent Monte Carlo dynamics.

When the ratio between MPI tasks and OMP threads is decreased, the workload becomes balanced again:

yet the cores are not working at 100% of their capacity.

This behavior might affect all the pure MPI calculations run on the AION nodes, and might be also responsible of the slowdown experienced by some users with the FHI-aims code.

How do you disable hyper-threading in Meluxina? I suppose that you use some form of OMP_PLACES and OMP_PROC_BIND combination. These settings however are affecting only your program, it is possible that the operating system is using threads in your cores in the background.

Normally MPI threads are active while waiting, so it is improbable that the inactive threads in Aion are in a waiting loop. So, something else must be delaying your threads. Leaving no thread for the operating system may be an explanation. For instance if some I/O is performed, it may explain the inactive threads. Some ways to check this theory:

  • Are the inactive threads always the same? If the problem is something like I/O that all the processes need to perform, I would expect the inactive threads to change.
  • What happens when you leave some cores for the operating system? E.g. get a full node in Aion, but open 120 (out of 128) processes.

If you would like to provide access to the code we could help you design benchmarks to further diagnose the issue.

- 124 nodes are working at 100% and 4 nodes at 0% , does this behaviour is constant?
- No! It is not constant it can vary. Also the cores involved are not always the same 

Yea, it seems that leaving no hardware threads for the operating system is an issue. Hardware threads really speed up operations like I/O.

Also, you know if this behaviour is affected by the update in Aion?

Could you share your SBATCH script and your submission command line?

If you compile your application without OpenMP support, do you still have the same problematic behavior?

I did not disable Meluxina’s hyperthreading, I just asked for 128 cores/256 and specified the distributions of the MPI tasks on the sockets.

The code does not use a lot of IO. Only the output file is updated by the rank 0 but this would at most lead to asynchronisation between the tasks and not to the fact that the load of some cores goes to 0%.

The code continues to run but on a lower amount of cores.

Yes, sorry, there is a single thread per process in your run. As @xbesseron, removing OpenMP could isolate one more probable cause.

In any case, if you see that a variable number of cores hangs randomly and they are not always the same cores, try using fewer cores to see if you get better performance.

I have problems submitting the script.
You can find it in:
/home/users/mbarborini/test/aion/vmc_gf_oblas_mpi08_omp_srun/aion_vmc.sh

I have problems disabling OMP because OBLAS requires the threading I can try compiling with intel.

1 Like

(Yes, in the past I had issue replying on the hpc-discourse)

A few more comments:

  • I don’t have permissions to open the file you mentioned.

  • By OBLAS, do you mean OpenBLAS? It should be possible to use it without OpenMP. You compile your source code without the -fopenmp flag but you still link with -pthread.

  • Do you know if your application uses MPI_Init() and MPI_Init_thread()? And then with which level of thread support? (cf MPI_Init_thread(3) man page (version 4.0.7))

  • Also, to make sure, do you correctly set OMP_NUM_THREADS=1 when you run with one MPI process per core?

The submission command is:

export MKL_NUM_THREADS=1

time srun -N $SLURM_NNODES -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK --ntasks-per-socket=$SLURM_NTASKS_PER_SOCKET --ntasks-per-node=$SLURM_NTASKS_PER_NODE $QMECHA_EXE -i vmc.inp $QMECHA_FLAGS > out_$SLURM_CPUS_PER_TASK

@matteo.barborini cannot post the code due to the Web Application Firewall issue. We are helping SIU update the filters as problems arise.

export MKL_NUM_THREADS=1

If compiled with foss, I think OMP_NUM_THREADS should be used.

Beside that, I don’t see any obvious mistakes.
I could suggest my favorites srun flags in addition to yours:
--mem=0 --exact --exclusive --hint=nomultithread --cpu-bind=verbose,cores
It is worth trying.

1 Like

The MPI/OMP environment is initialized by the module:

!===============================================================================
!> MODULE openmp_mpi_m (QMeCha BETA VERSION 2023)
!>  
!> @Author Matteo Barborini (matteo.barborini@gmail.com)
!>  
!> @Private repository https://github.com/mbarborini/QMeCha_beta
!>
!> Module used to construct the omp/mpi environment that define the parallelization
module openmp_mpi_m
use fortran_kinds_v, only: int32, stdout
use write_lines_m,   only: write_separator_line, write_simple_line, write_empty_line, &
                         & write_variable_line
#ifdef _MPI
use mpi
#endif
#ifdef _MPI08
use mpi_f08
#endif
implicit none
#ifdef _MPIh
include 'mpif.h'
#endif
!> @int32 mpi_rank    Rank of a specific mpi task
!> @int32 n_mpi_tasks Total number of mpi tasks
!> @int32 n_omp_tasks Total number of omp tasks
integer(int32), public, save :: mpi_rank
integer(int32), public, save :: n_mpi_tasks
integer(int32), public, save :: n_omp_tasks
!> @int   mpierr      Error variable to check reading of input
integer       , public :: mpierr

#ifdef _OMP
integer(int32), external :: omp_get_num_threads, omp_get_thread_num
#endif
public :: init_ompmpi_env, fnlz_ompmpi_env

contains
!-------------------------------------------------------------------------------
!> Initialization of OMP / MPI environment
subroutine init_ompmpi_env()
#if defined _MPI || defined _MPIh || defined _MPI08
    call mpi_init(mpierr)
    call mpi_comm_size(MPI_COMM_WORLD,n_mpi_tasks,mpierr)
    call mpi_comm_rank(MPI_COMM_WORLD,mpi_rank,mpierr)
    call mpi_barrier  (MPI_COMM_WORLD,mpierr)
#else
    n_mpi_tasks = 1_int32
    mpi_rank    = 0_int32
#endif
#ifdef _OMP
!$omp parallel
    n_omp_tasks = omp_get_num_threads()
!$omp end parallel
#else 
    n_omp_tasks = 1_int32 
#endif
    call write_separator_line(stdout,0,mpi_rank,2,"_")
    call write_simple_line(stdout,0,mpi_rank,2,"c","INITIALIZING MPI/OMP ENVIRONMENT")
    call write_empty_line(stdout,0,mpi_rank)
    call write_variable_line(stdout,0,mpi_rank,2,"Number of MPI tasks", n_mpi_tasks,var_name="n_mpi_tasks")
    call write_variable_line(stdout,0,mpi_rank,2,"Number of OPENMP threads per MPI task", n_omp_tasks,var_name="n_omp_tasks")
end subroutine init_ompmpi_env
!-------------------------------------------------------------------------------
!> Finalize MPI/OMP environment
subroutine fnlz_ompmpi_env()
#if defined _MPI || defined _MPIh || defined _MPI08
    call write_separator_line(stdout,0,mpi_rank,2,"_")
    call write_simple_line(stdout,0,mpi_rank,2,"c","FINALIZING MPI TASKS")
    call mpi_finalize(mpierr)
    call write_separator_line(stdout,0,mpi_rank,2,"=")
#endif
end subroutine fnlz_ompmpi_env
!-------------------------------------------------------------------------------      
end module openmp_mpi_m
!===============================================================================

I’m not sure why it should be a software issue since in the node there are still 128 tasks running. The problem becomes how these are redistributed in the node.

Without export OMP_NUM_THREADS=1 (or export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK), each process of your program will run with 128 OpenMP threads.

128 threads running all on a single core is pretty bad as they will all compete for the same resource and slow down each others. Because of that, some processes could get very slow and delay some MPI communications which will cause other processes to wait.

To make sure that your processes are correctly distributed, add --cpu-bind=verbose to your srun command line and check the output.

I rerun the test, submitting 128 serial calculations running in parallel on the node aion-0099. The script used is the following:

Clearly, the threads are fixed to 1

As you can see even with 128 serial jobs we have some cores of the node that are not working.

So the problem is not the parallelization inside the code.

Sorry, I can see that you tried to add you job script. I have reported the issue to the SIU and we are waiting the fix.

The processes are bound by srun/mpirun commands, not OpenMP. So in the single thread examples, you need to instruct where you want your processes placed. If even 2 processes are placed on the same core, or processes are reassigned cores during execution and there are time intervals when processes share cores, it can explain the large delay in the 128 process case. Again, our cores are single threaded, the hyper-threaded cores in Meluxina may handle this workload variation more efficiently.

Unfortunately I am not very familiar with how the srun wrapper handles the core binding arguments, but in Intel MPI and OpenMPI there are flags to control process placement, like the --bind-to flag.

I think that PMI handles this options, and srun uses PMI like the Intel MPI and OpenMPI launchers, so, srun should have similar options.

Setting the OpenMP number of threads

I can see in the submission script that you set OMP_NUM_THREADS and with MKL_NUM_THREADS=1 you make all calls to MKL serial.

MKL_NUM_THREADS sets the number of OpenMP threads for the MKL library only. Other libraries and you code (if you use omp pragmas) continue to use the default number of threads determined by the OpenMP library you are using (yes, even iomp does not use MKL_NUM_THREADS).

Can you add --cpu-bind=verbose to your srun command?
This will show how the processes are distributed and attached to the cores.

So the problem is not the parallelization inside the code.

Unfortunately, it is not that simple. A load balancing issue would show exactly the same symptoms.

This is for the 128 serial runs on one node

cpu-bind=MASK - aion-0099, task  0  0 [718711]: mask 0xffff set
cpu-bind=MASK - aion-0099, task  1  1 [718712]: mask 0xffff set
cpu-bind=MASK - aion-0099, task  2  2 [718713]: mask 0xffff set
cpu-bind=MASK - aion-0099, task  3  3 [718714]: mask 0xffff set
cpu-bind=MASK - aion-0099, task  4  4 [718715]: mask 0xffff set
cpu-bind=MASK - aion-0099, task  5  5 [718716]: mask 0xffff set
cpu-bind=MASK - aion-0099, task  6  6 [718717]: mask 0xffff set
cpu-bind=MASK - aion-0099, task  7  7 [718718]: mask 0xffff set
cpu-bind=MASK - aion-0099, task  8  8 [718719]: mask 0xffff set
cpu-bind=MASK - aion-0099, task  9  9 [718720]: mask 0xffff set
cpu-bind=MASK - aion-0099, task 10 10 [718721]: mask 0xffff set
cpu-bind=MASK - aion-0099, task 11 11 [718722]: mask 0xffff set
cpu-bind=MASK - aion-0099, task 12 12 [718723]: mask 0xffff set
cpu-bind=MASK - aion-0099, task 13 13 [718724]: mask 0xffff set
cpu-bind=MASK - aion-0099, task 14 14 [718725]: mask 0xffff set
cpu-bind=MASK - aion-0099, task 15 15 [718726]: mask 0xffff set
cpu-bind=MASK - aion-0099, task 16 16 [718727]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 17 17 [718728]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 18 18 [718729]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 19 19 [718730]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 20 20 [718731]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 21 21 [718732]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 22 22 [718733]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 23 23 [718734]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 24 24 [718735]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 25 25 [718736]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 26 26 [718737]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 27 27 [718738]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 28 28 [718739]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 29 29 [718740]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 30 30 [718741]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 31 31 [718742]: mask 0xffff0000 set
cpu-bind=MASK - aion-0099, task 32 32 [718743]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 33 33 [718744]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 34 34 [718745]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 35 35 [718746]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 36 36 [718747]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 37 37 [718748]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 38 38 [718749]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 39 39 [718750]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 40 40 [718751]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 41 41 [718752]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 42 42 [718753]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 43 43 [718754]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 44 44 [718755]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 45 45 [718756]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 46 46 [718757]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 47 47 [718758]: mask 0xffff00000000 set
cpu-bind=MASK - aion-0099, task 48 48 [718759]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 49 49 [718760]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 50 50 [718761]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 51 51 [718762]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 52 52 [718763]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 53 53 [718764]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 54 54 [718765]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 55 55 [718766]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 56 56 [718767]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 57 57 [718768]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 58 58 [718769]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 59 59 [718770]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 60 60 [718771]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 61 61 [718772]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 62 62 [718773]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 63 63 [718774]: mask 0xffff000000000000 set
cpu-bind=MASK - aion-0099, task 64 64 [718775]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 65 65 [718776]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 66 66 [718777]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 67 67 [718778]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 68 68 [718779]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 69 69 [718780]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 70 70 [718781]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 71 71 [718782]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 72 72 [718783]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 73 73 [718784]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 74 74 [718785]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 75 75 [718786]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 76 76 [718787]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 77 77 [718788]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 78 78 [718789]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 79 79 [718790]: mask 0xffff0000000000000000 set
cpu-bind=MASK - aion-0099, task 80 80 [718791]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 81 81 [718792]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 82 82 [718793]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 83 83 [718794]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 84 84 [718795]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 85 85 [718796]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 86 86 [718797]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 87 87 [718798]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 88 88 [718799]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 89 89 [718800]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 90 90 [718801]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 91 91 [718802]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 92 92 [718803]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 93 93 [718804]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 94 94 [718805]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 95 95 [718806]: mask 0xffff00000000000000000000 set
cpu-bind=MASK - aion-0099, task 96 96 [718807]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 97 97 [718808]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 98 98 [718809]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 99 99 [718810]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 100 100 [718811]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 101 101 [718812]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 102 102 [718813]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 103 103 [718814]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 104 104 [718815]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 105 105 [718816]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 106 106 [718817]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 107 107 [718818]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 108 108 [718819]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 109 109 [718820]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 110 110 [718821]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 111 111 [718822]: mask 0xffff000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 112 112 [718823]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 113 113 [718824]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 114 114 [718825]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 115 115 [718826]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 116 116 [718827]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 117 117 [718828]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 118 118 [718829]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 119 119 [718830]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 120 120 [718831]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 121 121 [718832]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 122 122 [718833]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 123 123 [718834]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 124 124 [718835]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 125 125 [718836]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 126 126 [718837]: mask 0xffff0000000000000000000000000000 set
cpu-bind=MASK - aion-0099, task 127 127 [718838]: mask 0xffff0000000000000000000000000000 set

I will create a test fortran code with my MPI environment, test it, and put everything in a public github repository.

OK. It seems that each process is attached to a group of 16 cores, which is shared by 16 processes. It means that a process is free to run on any of these 16 cores at a given time, and move between cores. Potentially, 2 processes can run on the same core, that would explain the behavior you’re seeing with htop.

Now, can you now add --cpu-bind=verbose,cores to your srun command line?
That will attach each process to a single core and prevent the process to move between cores.

OK, this is some confusing notation! The mask is little endian, and the verbose does not print the leading zeros.

@matteo.barborini Basically the mask is a byte array where 1 at index n indicates that the process (task) is allowed to move into the core with index n. The mask is printed as a number in hexadecimal notation. Basically the processes are bound to the NUMA node where they start. With cores each process will be bound to a single core.

The experiment of targeting the node aion-0099, running consecutive pure MPI calculations with and without core binding was successful.

The same process took respectively:

core Binding 26.5 sec.
default Binding (Socket?) 112.3 sec.

So the problem seems to be how slurm handles the distribution of the tasks within the Socket during the run.

Why do two MPI tasks end up on the same core leaving other cores unused?