3.3 Using aprun to execute parallel processes

The command aprun is a Cray Linux Environment utility which launches the executable on compute nodes. This command is analogous to the SLURM command srun, which should not be used in Sisu. The following table lists the most important options for aprun. For a complete description of options see the manual page aprun(1).

man aprun
Table 3.4 Most important aprun options.

 

aprun option Description
-n PEs The number of processing elements (PEs, in Cray terminology), often same as number of cores needed by an application. In Sisu it defines the number of MPI tasks. The default is 1.
-N PEs_per_node The number of PEs per node, at Sisu: number of MPI tasks per compute node. Not to be confused with -N option of sbatch, which has completely different meaning.
-m size Specifies the memory per PE required, at Sisu: memory required per MPI task.  Resident Set Size memory size in megabytes. K, M, and G suffixes are supported (16M = 16 megabytes, for example). Any truncated or full spelling of unlimited is recognized.
-d depth The number of threads per PE, at Sisu: number of OpenMP threads per MPI task. The default is 1.
-j num_cpus

Specifies how many CPUs to use per compute unit for an ALPS job, at Sisu: number of logical CPU cores per a physical CPU core. In sisu, at most 2 logical cores are available per a single physical core. If the user does not want to use logical CPU cores, a setting of -j 1 is recommended.

-L node_list The node_list specifies the candidate nodes to constrain application placement. The syntax allows a comma-separated list of node IDs (node,node,...), a range of nodes (node_x- node_y), and a combination of both formats
-e ENV_VAR=value Set an environment variable on the compute nodes, must use format VARNAME=value. To set multiple environment variables use multiple -e arguments.
-S PEs_per_NUMA_node The number of PEs per NUMA node, at Sisu: number of MPI tasks per NUMA node. Each compute node in Sisu has two sockets and one socket has one 12-core processor. Each socket is a NUMA node that contains a 12-core processor and its local NUMA node memory.
-ss Request strict memory containment per NUMA node, i.e. to allocate memory only from local NUMA node.
-cc CPU_binding Controls how tasks are bound to cores and NUMA nodes. -cc none means that MPI affinity is not needed. -cc numa_node constrains MPI tasks and OpenMP threads to local NUMA node. Default is -cc cpu that is the typical use case for an MPI job (it binds each MPI task to a core).

 
An exemplary script, jobtest.sh, for running a parallel job:  

#!/bin/bash -l
#SBATCH -N 6
#SBATCH -J cputest
#SBATCH -p small
#SBATCH -o /wrk/username/%J.out
#SBATCH -e /wrk/username/%J.err
aprun -n 144 /wrk/username/my_program
 
Submitting jobtest.sh:
sbatch jobtest.sh 

More information:

    Previous chapter     One level up     Next chapter