4.4 Shared memory and hybrid parallelization

The Haswell compute nodes on Taito contain two twelve core processors (24 cores per node). The older Sandy Bridge compute nodes instead contain two octa-core processors (16 cores per node). Hence, it is possible to run shared memory parallel (OpenMP) programs efficiently within a node with respectively twenty four and sixteen threads at maximum.

4.4.1 How to compile

Both Intel and GNU compilers support OpenMP. Use the following compiler flags enable OpenMP support.

Table 4.6 OpenMP compiler flags

Compiler Flag
Intel -openmp
GNU -fopenmp

Here are examples for OpenMP and mixed (i.e. hybrid) OpenMP/MPI compiling (upper line: Intel compiler, second line: GNU-compiler

f95 -openmp -o my_openmp_exe my_openmp.f95
mpif90 -fopenmp -o my_hybrid_exe my_hybrid.f95

See OpenMP web site for more information including standards and tutorials.

Include files

For Fortran 77 use following line:

include 'omp_lib.h'

For Fortran 90 (and later) use:

use omp_lib

For C/C++ use:

#include <omp.h>

4.4.2 Running OpenMP programs

The number of OpenMP threads is specified with an environment variable OMP_NUM_THREADS. Running a shared memory program typically requires requesting a whole node. Thus, a sixteen thread OpenMP job can be run interactive on Sandy Bridge processors as shown in following examples. If you wish to run the same on Haswell, remember to edit the core amount accordingly (24 cores per node). If you find out that OpenMP sections of your code do not give run-to-run numerical stability try (with Intel compiled code) to set the variable KMP_DETERMINISTIC_REDUCTION=yes.

Sample session for Intel compiled OpenMP program (Sandy Bridge):

export KMP_AFFINITY=compact
export KMP_DETERMINISTIC_REDUCTION=yes   #(if necessary and intel compiler version is 13 or later)
export OMP_NUM_THREADS=16
salloc -N1 --cpus-per-task=16 --mem-per-cpu=1000 -t 01:00:00
srun ./my_openmp_exe
exit

Sample session for GNU compiled OpenMP program (Sandy Bridge):

export GOMP_CPU_AFFINITY=0-15
export OMP_NUM_THREADS=16
salloc -n 1 -N 1 --cpus-per-task=16 --mem-per-cpu=1000 -t 01:00:00
srun ./my_openmp_exe
exit

The corresponding batch queue script would be (below just for Intel compiled OpenMP program, Sandy Bridge):

#!/bin/bash -l
#SBATCH -J my_openmp
#SBATCH -e my_output_err_%j
#SBATCH -o my_output_%j
#SBATCH --mem-per-cpu=1000
#SBATCH -t 01:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --cpus-per-task=16
export KMP_AFFINITY=compact
export KMP_DETERMINISTIC_REDUCTION=yes    #(if necessary and intel compiler version is 13 or later)
export OMP_NUM_THREADS=16
srun ./my_openmp_exe

In the above example replace "export KMP_AFFINITY=compact" with "export GOMP_CPU_AFFINITY=0-15" if a code is compiled by GNU compiler. The KMP_DETERMINISTIC_REDUCTION do not help with GNU compiled code.

4.4.3 Hybrid parallelization

In many cases it is beneficial to combine MPI and OpenMP parallelization. More precisely, the inter-node communication is handled with MPI and for communication within the nodes OpenMP is used. For example, on Sandy Bridge, consider an eight-node job in which there is one MPI task per node and each MPI task has sixteen OpenMP threads, resulting in a total core (and thread) count of 128. Running a hybrid job can be done interactively as above with the exception that more nodes are specified and for each node one MPI task is requested. The parallel partition must be requested to run the program because there are more than one node. That is, for a 8 x 16 job the following flags are used

export KMP_AFFINITY=compact
export KMP_DETERMINISTIC_REDUCTION=yes   #(if necessary and intel compiler version is 13 or later)
export OMP_NUM_THREADS=16
salloc -p parallel -N 8 -n 8 --cpus-per-task=16 --mem-per-cpu=1000 -t 02:00:00
srun ./my_hybrid_exe
exit

The corresponding batch queue script would be (for Intel compiled code, Sandy Bridge):

#!/bin/bash -l
#SBATCH -J my_hybrid
#SBATCH -e my_output_err_%j
#SBATCH -o my_output_%j
#SBATCH --mem-per-cpu=1000
#SBATCH -t 02:00:00
#SBATCH -N 8
#SBATCH -n 8
#SBATCH --cpus-per-task=16
#SBATCH -p parallel
export OMP_NUM_THREADS=16
export KMP_AFFINITY=compact
export KMP_DETERMINISTIC_REDUCTION=yes  #(if necessary and intel compiler version is 13 or later)
srun ./my_hybrid_exe

In the above example replace "export KMP_AFFINITY=compact" with "export GOMP_CPU_AFFINITY=0-15" if a code is compiled by GNU compiler. The KMP_DETERMINISTIC_REDUCTION do not help with GNU compiled code.

4.4.4 Binding threads to cores

The compilers on Taito support thread/core affinity which binds threads to cores for better performance. This is enabled with compiler-specific environment variables as follows (Sandy Bridge):

Intel:

export KMP_AFFINITY=compact

GNU:

export GOMP_CPU_AFFINITY=0-15

If one does not set these variables and values all threads in a node might run in a same core. Learn more about Intel thread affinity interface. Support information for GNU compilers for tread affinity can be found here.

 

Previous chapter     One level up     Next chapter