4.4 Shared memory and hybrid parallelization

A short introduction to MPI/OpenMP hybrid programming on Sisu.

Each compute node on Sisu contains two 12-core processors (24 cores all together). Hence it is possible to run hybrid parallel (MPI/OpenMP) programs efficiently.


4.4.1 How to compile

All Programming environments (Cray, Gnu and Intel) support OpenMP. Use the following compiler flags to enable OpenMP support.

Table 4.5 Compiler flags to enable OpenMP support

Compiler Flag
Cray no flag needed (OpenMP support is on by default)
Gnu -fopenmp
Intel -openmp


Here are static compilation examples for OpenMP and mixed (i.e. hybrid) MPI/OpenMPI (first line: Cray compiler, second line: Gnu-compiler, third line: Intel Compiler ).

ftn -o my_hybrid_exe my_hybrid.f95
ftn -fopenmp -o my_hybrid_exe my_hybrid.f95
ftn -openmp -o my_openmp_exe my_openmp.f95
See OpenMP web site for more information including standards and tutorials

 

4.4.2 Include files

For Fortran 77  use following line

include 'omp_lib.h'
For Fortran 90 (and later) use
use omp_lib
For C/C++ use
#include <omp.h>

 

4.4.3 Running hybrid MPI/OpenMP programs

In many cases it is beneficial to combine MPI and OpenMP parallelization. More precisely, the inter-node communication is handled with MPI and for communication within the nodes OpenMP is used. For example, consider an eight-node job in which there is one MPI task per node and each MPI task has 24 OpenMP threads, resulting in a total core (and thread) count of 192. That is, for a 8 x 24 job the following flags are used. When running/submitting dynamically compiled executables remember to load same Programming Environment (PrgEnv-cray, PrgEnv-gnu or PrgEnv-intel) that was loaded when compiling was done.

#!/bin/bash --login
## hybrid MPI/OpenMP example
## valid for PrgEnv-cray and PrgEnv-gnu

## The number of compute nodes for a 8 node mpi/openmp job (8*24=192)
## Job layout: one mpi process per compute node and 24 openmp threads per compute node
#SBATCH --nodes=8

## Choose a suitable queue <test,small,large>
## How to check queue limits: scontrol show part <queue name>
## for example: scontrol show part small
#SBATCH -p small

#SBATCH -J jobname
#SBATCH -o jobname_%J.out
#SBATCH -e jobname_%J.err
#SBATCH -t 01:01:00

## number of OpenMP threads
export OMP_NUM_THREADS=24

## option: -n ( total number of mpi tasks )
## option: -d ( number of OpenMP threads per mpi task )
## option: -j ( number of logical CPUs per physical CPU core )
## option: -N ( number of mpi tasks per compute node )
## run the application on compute nodes
aprun -n 8 -d 24 -N 1 -j 1 ./hybrid_executable
Same example for Intel compiled hybrid program.
#!/bin/bash --login
## hybrid MPI/OpenMP example
## valid for PrgEnv-intel

## The number of compute nodes for a 8 node mpi/openmp job (8*24=192)
## Job layout: one mpi process per compute node and 24 openmp threads per compute node
#SBATCH --nodes=8

## Choose a suitable queue <test,small,large>
## How to check queue limits: scontrol show part <queue name>
## for example: scontrol show part small
#SBATCH -p small

#SBATCH -J jobname
#SBATCH -o jobname_%J.out
#SBATCH -e jobname_%J.err
#SBATCH -t 01:01:00

## number of OpenMP threads
export OMP_NUM_THREADS=24

## define intel thread affinity
export KMP_AFFINITY="granularity=fine,compact,1"

## option: -n ( total number of mpi tasks )
## option: -d ( number of OpenMP threads per mpi task )
## option: -j ( number of logical CPUs per physical CPU core )
## option: -N ( number of mpi tasks per compute node )
## option: -cc none ( mpi affinity not needed, this is a must in a Intel case )
 ## run the application on compute nodes
aprun -n 8 -d 24 -N 1 -j 1 -cc none ./hybrid_intel_executable
If you find out that OpenMP sections of your code do not give run-to-run numerical stability try (with Intel compiled code) to set the variable KMP_DETERMINISTIC_REDUCTION=yes.

Because each compute node on Sisu contains two 12-core processors it might be useful to try hybrid MPI/OpenMP job that has 2 mpi processes per compute node. In the next batch job example one MPI process is allocated per socket (each compute node has two sockets and one socket has one 12-core processor). Once again total core (and thread) count is 192 (16 mpi process and each mpi process has 12 OpenMP threads). Each socket in Sisu is a NUMA node that contains a 12-core processor and its local NUMA node memory (each core in a compute node also has access to remote NUMA node memory but references to remote NUMA memory are not that optimal like local references) .
#!/bin/bash --login
## hybrid MPI/OpenMP example
## valid for PrgEnv-cray, PrgEnv-gnu and PrgEnv-intel

## The number of compute nodes for a 8 node mpi/openmp job (16*12=192)
## Job layout: two mpi process per compute node and 12 openmp threads per compute node
## And more precisely: one MPI process per socket
#SBATCH --nodes=8

## Choose a suitable queue <test,small,large>
## How to check queue limits: scontrol show part <queue name>
## for example: scontrol show part small
#SBATCH -p small

#SBATCH -J jobname
#SBATCH -o jobname_%J.out
#SBATCH -e jobname_%J.err
#SBATCH -t 01:01:00
## number of OpenMP threads
export OMP_NUM_THREADS=12

## with PrgEnv-intel next line is a must (please remove comment characters, ##)
## export KMP_AFFINITY="compact,1"
## ( when PrgEnv-cray or PrgEnv-gnu do NOT use above line)

## option: -n ( total number of mpi tasks )
## option: -d ( number of OpenMP threads per mpi task )
## option: -S ( number of mpi tasks per NUMA node)
## option: -ss ( allocate memory only from local NUMA node)
## option: -j ( number of logical CPUs per physical CPU core )
## option: -N ( number of mpi tasks per compute node )
## -cc numa_node ( mpi tasks and OpenMP threads are constrained to the local NUMA node)
## run the application on compute nodes
aprun -n 16 -d 12 -S 1 -ss -j 1 -N 2 -cc numa_node ./hybrid_exe

 

    Previous chapter     One level up     Next chapter