4.3 Using MPI

4.3.1 Message passing Interface

MPI (Message Passing Interface) is a standard specification for message passing libraries. It allows portable parallel programs in Fortran and C. MPI has become a de facto standard for communication among processes that create a parallel application running on a distributed memory system (like Sisu.csc.fi).

In message passing each process/task has an address space in memory that other processes/tasks cannot directly access. In parallel application these processes/task communicate with each other by message passing. On Sisu (Cray XC40 Supercomputer) message passing implementation is optimized to take advantage of high speed Aries interconnect. All Programming Environments (PrgEnv-cray, PrgEnv-gnu and PrgEnv-intel) can utilize the MPI library that is implemented by Cray. Current release implements the MPI-3.0 standard but dynamic process management is not supported, see more information from manual page mpi_intro (command: man mpi_intro). Manual page section NOTES describes the MPI commands that are not supported and section ENVIRONMENT VARIABLES describes values that mainly control the runtime behaviour of message passing. For better performance it might be sometimes useful to change some values of these variables.

4.3.2 Compiling and linking

All compilers are accessed through the Cray drivers (wrapper scripts) ftn, cc and CC. No matter which vendor's compiler module is loaded, always use ftn, cc and CC commands to invoke the compiler. ftn will launch a Fortran compiler, cc will launch a c compiler and  CC will launch a C++ compiler. If you compile and link in separate steps, use the Cray driver commands also in the linking step and not the linker ld directly. No additional MPI library linking options are required with the Cray wrappers.

For example, MPI programs written in Fortran 90, C and C++ can be built as follows:

ftn my_fortran_mpi_code.f95
cc my_c_mpi_code.c
CC my_C_mpi_code.C

Compile fortran file:

ftn -c f_routine.f95

Compile c file:

cc -c c_routine.c


Link object files into an static executable:
ftn -o my_mpi_app c_routine.o f_routine.o
Link object files into an dynamically linked executable:
ftn -dynamic -o my_mpi_app c_routine.o f_routine.o 

Chapter 4.1 Compiling Environment has more more information about compiler flags.

4.3.3 Include files

In Fortran, a source code file containing MPI calls must include an include file. When using Fortran 77, source code should contain the line:

include 'mpif.h'

With Fortran 90 ( or later ) it is recommended to use the MPI module:

use mpi

In C/C++ one should use

#include <mpi.h>

There are name-space clashes between stdio.h and the MPI C++ binding. To avoid this conflict make sure your application includes the mpi.h header file before stdio.h or iostream.h

4.3.4 Running MPI batch job

A basic MPI batch job example.

!/bin/bash -l
## The number of compute nodes for a 144 mpi processes job (6*24=144)
#SBATCH --nodes 6
## It is recommended to allocate just the number of nodes.
## Each compute node has 24 cores (See more details in section Hardware on Sisu User Guide).
## Give the number mpi processes and other job launching details on aprun line
## (see the last line of this example)

## Choose a suitable queue <test,small,large>
## How to check queue limits: scontrol show part <queue name>
## for example: scontrol show part small
#SBATCH -p test

## Name of your job
#SBATCH -J jobname

## System message output file
#SBATCH -o jobname_%J.out

## System error message file
#SBATCH -e jobname_%J.err

## How long job takes, wallclock time hh:mm:ss
#SBATCH -t 00:11:00

## Run MPI executable on compute nodes
## option -n gives the number of processes (recommendation: multiplies of 24)
## Above we have allocated 6 compute nodes, so it is possible to run 6*24=144 mpi processes.
## Calculate the total number of cores and store it in variable ncores
(( ncores = SLURM_NNODES * 24 ))
aprun -n $ncores /wrk/$USER/mpi_executable 
Each compute node has 24 cores and 64 GB memory (per core memory size is 2.67GB). Strong recommendation: Submit jobs where number of the allocated cores is divisible by 24. More information can be found on Chapter Using Batch Job Environment.

When running memory intensive jobs (jobs that need more than 2.67GB memory per MPI task) the application must use  less than 24 cores per compute node. On next example the parallel job can have nearly 8 GB per core (per MPI task) by using 8 cores per compute node. Furthermore, one specifies that each socket has 4 MPI tasks (each compute node has two sockets and one socket has one 12-core processor and its local 32 GB memory). A socket is also a NUMA node (so each compute node has two NUMA nodes).
!/bin/bash -l
## memory intensive example (actually taito.csc.fi should be better for memory intensive jobs).
## 144 mpi processes will need 18 nodes if we use just 8 cores per node 144/8=18
## the number of compute nodes
#SBATCH --nodes 18
## Give the number mpi processes and other job launching details on aprun line
## (see the last line of this example)

## name of your job
#SBATCH -J jobname
## system message output file
#SBATCH -o jobname_%J.out
## system error message file
#SBATCH -e jobname_%J.err
## how long job takes, wallclock time hh:mm:ss
#SBATCH -t 11:01:00

## Choose a suitable queue <test,small,large>
## How to check queue limits: scontrol show part <queue name>
#SBATCH -p small

## option: -n (total number of mpi processes)
## option: -N (number of mpi processes per compute node)
## option: -S (number of mpi processes per NUMA node)
## option: -ss (allocate memory only from a local NUMA node)
## run the application on compute nodes
(( ncores = SLURM_NNODES * 8 ))
aprun -n $cores -N 8 -S 4 -ss ./mpi_executable 
Please remember that taito.csc.fi has very good facilities for memory intensive jobs. On Taito each node has at least 64GB memory and it has also 26 compute nodes that has 256GB memory and six nodes that have 1,5 TB of memory. Thus, if an memory intensive application does not benefit from the Aries interconnect then taito.csc.fi is better choice.


4.3.5 Manual pages

More information can be found on the manual pages:

man mpi_intro
man sbatch
man aprun


    Previous chapter     One level up     Next chapter