3.3 Parallel batch jobs
Two approaches are commonly used in creating software that are able to utilize several computing cores. Message Passing interface (MPI) based methods and threads based programs (POSIX-threads, OpenMP). Considering CSC resources, Sisu supercomputer is intended for large MPI based parallel jobs but smaller MPI jobs can be run in the Taito supercluster too. In case of threads-based parallel programs, the jobs should be executed mainly in the Taito supercluster.
In case of threads-based parallel computing, the number of parallel processes (threads) is limited by the structure of the hardware: all the processes must be running in the same node. Thus in the Sandy Bridge nodes Taito cluster, threads-based programs can't use more than 16 computing cores. In Haswell processors, the maximum is 24 threads.
Sbatch option --cpus-per-task=number_of_cores is used the define the number of computing cores that the batch job will use. Option --nodes=1 ensures that all the reserved cores will be located in the same node and -n 1 will assign all the reserved computing cores for the one same task.
In the case of threads-based jobs, the --mem option is recommended for memory reservation. This option defines the amount of memory needed per node. Note that if you use --mem-per-cpu option instead, the total memory request of the job will be memory request multiplied by the number-of-cpus. Thus if you modify the number of cores to be used, you should check the memory reservation too.
#!/bin/bash -l #SBATCH -J bowtie2 #SBATCH -o output_%j.txt #SBATCH -e errors_%j.txt #SBATCH -t 02:00:00 #SBATCH -n 1 #SBATCH --nodes=1 #SBATCH --cpus-per-task=6 #SBATCH -p serial #SBATCH --mem=6000 # module load biokit bowtie2-build chr_18.fa chr_18 bowtie2 -p $SLURM_CPUS_PER_TASK -x chr_18 -1 y_1.fq -2 y_2.fq > output.sam
In the example above, one task (-n 1) that uses 6 cores (--cpus-per-task=6) with total of 6 GB of memory (--mem=6000) is reserved for two hours (-t 02:00:00). All the cores are assigned from one computing node (--nodes=1). When the job starts, the CSC bioinformatics environment, that includes Bowtie2, is first set up with command:
module load biokit
After that two bowtie2 commands are executed. The indexing command, bowtie2-build, does not utilize parallel computing. In case of the bowtie2 command, the number of cores to be used is defined with option -p. In this case we are using six cores so the definition could be: -p 6. However in this case we use environment variable $SLURM_CPUS_PER_TASK instead. This variable contains the number of cores defined by the --cpus-per-task option. Thus by using $SLURM_CPUS_PER_TASK we don't have to modify the bowite2-align command if we change the number of cores to be used with the SBATCH options.
To compile Fortran + MPI code the following command can be used:
mpif90 my_mpi_prog.f95 -o my_mpi_program
The output executable program my_mpi_program is created.
#!/bin/bash -l ### ### parallel job script example ### ## name of your job #SBATCH -J my_jobname ## system error message output file #SBATCH -e my_output_err_%j ## system message output file #SBATCH -o my_output_%j ## a per-process (soft) memory limit ## limit is specified in MB ## example: 1 GB is 1000 #SBATCH --mem-per-cpu=1000 ## how long a job takes, wallclock time hh:mm:ss #SBATCH -t 11:01:00 ##the number of processes (number of cores) #SBATCH -n 24 ##parallel queue #SBATCH -p parallel ## run my MPI executable srun ./my_mpi_program
In case of MPI jobs, all the cores that the job uses must be either Sandy Bridge or Haswell. See the section 'Choosing between Sandy Bridge and Haswell nodes' below for details.
3.3.3 Interactive MPI-parallel jobs
The output executable program my_mpi_program can be run interactively with commands:
salloc -n 32 --ntasks-per-node=16 --mem-per-cpu=1000 -t 00:30:00 -p parallel srun ./my_mpi_program exit
-n number of processes (number of cores)
--ntasks-per-node On Taito there are 16 cores (Sandy Bridge) or 24 cores (Haswell) per node.That way your job will be distributed so that the number nodes is minimized
-t running time, wallclock, format hh:mm:ss (hours:minutes:seconds)
--mem-per-cpu per process memory limit (MB)
Other way (one-liner):
salloc -n 32 --ntasks-per-node=16 --mem-per-cpu=1000 -t 00:30:00 -p parallel srun ./my_MPI_executable
One can also use --ntasks-per-node option to control how the job is distributed to the nodes of the cluster.
3.3.4 Choosing between Sandy Bridge or Haswell nodes
In most cases the options discussed so far give the batch queue system enough information to decide how to place the tasks, and no further options are required. However, there are some exceptions.
126.96.36.199 Executables optimized for Haswell processors
Haswell processors can run code optimized for Sandy Bridge processors, but Sandy Bridge processors cannot run Haswell optimized executables. In most cases optimizing for Haswell processors does not give huge performance benefits over Sandy Bridge optimized executables, so in general it is easiest to optimize for Sandy Bridge processors. Some performance critical codes are built with Haswell specific optimization, in which case one needs to instruct the batch queue system to place the tasks on the Haswell nodes.
188.8.131.52 Parallel MPI programs
Parallel MPI programs should not be distributed on different kinds of nodes for performance reasons. The batch queue system default is to place the MPI tasks on on either Sandy Brige or Haswell nodes, which ever are available first. This is usually the correct choice.
184.108.40.206 Parallel OpenMP programs
When user specifies more that 16 threads per task, the batch queue system places the task automatically on Haswell nodes.
220.127.116.11 Scalability testing and benchmarking
When running benchmarks it is recommended to constrain the tasks to single type of nodes and reserve whole nodes to minimize jitter in the timings.
18.104.22.168 Options that constrain the node type
The option to constrain the tasks on Sandy Bridge nodes is
--constraint=snb, and the option for the Haswell nodes is
--constraint=hsw. We advice the use only these options, although more general constraints are possible.
|Previous chapter||One level up||Next chapter|