3.1 Constructing a batch job file

3.1.1 Batch Job Script Wizard

The most common way to use the SLURM batch job system is to first create a batch job file that is submitted to the scheduler with command sbatch. You can create batch job files with normal text editors or you can use the Batch Job Script Wizard tool, in the Scientist's User Interface( https://sui.csc.fi/group/sui/batch-job-script-wizard ), (see Figure 3.1). In the Batch Job Script Wizard, you first select the server you want to use and then fill in the settings for the batch job. The Batch Job Script Wizard can't directly submit the job, but with the "Save Script" you can save the batch job file directly to your home directory at CSC. After that you can use the My Files tool to further edit and launch the batch job (see paragraph 3.1.5).

Figure 3.1 Batch Job Script Wizard in the scientist's user interface (https://sui.csc.fi/group/sui/batch-job-script-wizard)

 

3.1.2 Structure of a batch job file

Below is an example of a SLURM batch job file made with a text editor:

#!/bin/bash -l
#SBATCH -J hello_SLURM
#SBATCH -o output.txt
#SBATCH -e errors.t
#SBATCH -t 01:20:00
#SBATCH -p serial
#
echo "Hello SLURM"

The first line of the batch job file (#!/bin/bash -l) defines that the bash shell will be used. The following five lines contain information for the batch job scheduler. The syntax of the lines is:

#SBATCH -sbatch_option argument


In the example above we use five sbatch options: -J that defines a name for the batch job (hello_SLURM in this case), -o defines file name for the standard output and -e for the standard error. -t defines that the maximum duration of the job is in this case 1 hour and 20 minutes. -p defines that the job is to be send to serial partition. After the batch job definitions comes the commands that will be executed. In this case there is just one command: echo "Hello SLURM" that prints text "Hello SLURM" to standard output.

The batch job file above can be submitted to the scheduler with command:

sbatch file_name.sh 


The batch job file above includes only the most essential job definitions. However, it is often mandatory or useful to use other sbatch options too. The options needed to run parallel jobs are discussed more in detail in the following chapters. Table 3.1 contains some of the most commonly used sbatch options. The full list of sbatch options can be listed with command:

sbatch -h

or

man sbatch

Table 3.1 Most commonly used sbatch options

Slurm option Description
--begin=time Defer job until HH:MM MM/DD/YY.
-c, --cpus-per-task=ncpus Number of cpus required per task.
-C, --constraint=value In Taito, the --constraint option can be used to select the processor type to be used (hsw = Haswell, snb = Sandy Bridge, "[hsw|snb]" = either snb or hsw, ssd = node with ssd based temporary directory )
-d, --dependency=type:jobid Defer job until condition on jobid is satisfied.
-e, --error=err File for batch script's standard error.
--ntasks-per-node=n Number of tasks to per node.
-J, --job-name=jobname Name of the job.
--mail-type=type Notify on state change: BEGIN, END, FAIL or ALL.
--mail-user=user Who to send email notification for job state changes.
-n, --ntasks=ntasks Number of tasks to run.
-N, --nodes=N Number of nodes on which to run.
-o, --output=out File for batch script's standard output.
-t, --time=minutes Time limit in format hh:mm:ss.
--mem=MB Maximum amount of real memory per node required by the job in megabytes. (Recommended for serial jobs and shared memory parallel jobs)
--mem-per-cpu=MB Maximum amount of real memory per allocated CPU required by the job in megabytes.(Recommended for MPI parallel jobs)
-p Specify queue (partition) to be used. In Taito the available queues are: serial, parallel, longrun, test and hugemem.

In the second batch job example below options --mail-type and --mail-user are used to make the batch system to send e-mail to address kkayttaj@uni.fi when to job ends. Further the job is defined to reserve 4GB of memory. In the output and error file definitions %j is used to use the job id-number in the file name, so that if the same batch job file is used several times, the old output and error files will not get overwritten.

#!/bin/bash -l
#SBATCH -J hello_SLURM
#SBATCH -o output_%j.txt
#SBATCH -e errors_%j.txt
#SBATCH -t 01:20:00
#SBATCH -n 1
#SBATCH -p serial
#SBATCH --mail-type=END
#SBATCH --mail-user=kkayttaj@uni.fi.email.address
#SBATCH --mem-per-cpu=4096
#

echo "Hello SLURM"
./my_command

 

3.1.3 Queues and resource requests

Setting optimal values for the requested computing time, memory and number of cores to be used is not always a simple task. It is often useful to first send short test jobs to get a rough estimate of the computing time and memory requirements of the job. It is safer to reserve more computing time than needed, but remember that jobs with large computing time request  may, and often have to, wait longer time in the queue than shorter jobs.

All the batch queues have maximum durations and maximum amount of nodes that a job can use. You can check these limits with command sinfo. For example:

sinfo -o "%10P %.5a %.10l  %.10s %.16F "
PARTITION  AVAIL  TIMELIMIT    JOB_SIZE   NODES(A/I/O/T)
serial*       up 3-00:00:00           1     768/98/1/867
parallel      up 3-00:00:00        1-28     768/98/1/867
longrun       up 14-00:00:0           1     767/95/1/863
test          up      30:00         1-2          1/3/0/4
hugemem       up 7-00:00:00           1          6/0/0/6

 

The sinfo output above tells that the cluster has five partitions ( parallel, serial, longrun, test and hugemem). For example, the maximum execution time in parallel queue is three days (3-00:00:00) and the jobs can use use up to 28 Haswell nodes ( 28 * 24= 672 cores). Similarly the maximum duration of jobs submitted to test queue is 30 minutes ( 30:00).

The cluster partition you are using should match the reservations for computing time, core number and memory. By default a job is submitted to the serial partition, where you can run serial jobs or parallel jobs that use up to 24 cores (one Haswell node) and require at most three days of run time. The maximum memory that can be reserved for a job in the serial partition is 256 GB. If your job requests exceeds these limits, you must use option -p to choose a partition, which meets the resource requests.

For example a serial job that requires 6 days of computing time can be executed in the longrun partition

#!/bin/bash -l
#SBATCH -J longrun_SLURM
#SBATCH -o output.txt
#SBATCH -e errors.txt
#SBATCH -t 6-00:00:00
#SBATCH -p longrun
#

./my_long_job

 

A small parallel job, that requires 1.0 TB of memory can be executed in the hugemem partition

#!/bin/bash -l
#SBATCH -J longrun_SLURM
#SBATCH -o output.txt
#SBATCH -e errors.txt
#SBATCH -t 06:00:00
#SBATCH -n 1
#SBATCH --mem-per-cpu=1000000
#SBATCH -p hugemem
#

./my_bigmemory_job

 

Estimating the memory request is even more difficult as it is dependent on several things like algorithm and software and the analysis task. In most case 1-4 GB is enough but you may need to increase the memory size in the case of some application.

Command sjstat can be used to check the available memory for nodes in different partitions. The sjstat command lists the scheduling pool data and the running jobs. The scheduling pool data can be used to check the available memory in different partitions. You can check just the scheduling pool data by adding option -c to the command:

sjstat -c

Scheduling pool data:
-------------------------------------------------------------
Pool        Memory  Cpus  Total Usable   Free  Other Traits  
-------------------------------------------------------------
serial*     64300Mb    16    450    450    118  snb,sandybridge
serial*    128600Mb    24    395    394      1  hsw,haswell
serial*    258000Mb    24     10     10      0  hsw,haswell
serial*    258000Mb    16     12     12      3  bigmem,snb,sandybridge
parallel    64300Mb    16    450    450    118  snb,sandybridge
parallel   128600Mb    24    395    394      1  hsw,haswell
parallel   258000Mb    24     10     10      0  hsw,haswell
parallel   258000Mb    16     12     12      3  bigmem,snb,sandybridge
longrun     64300Mb    16    450    450    118  snb,sandybridge
longrun    258000Mb    16      8      8      0  bigmem,snb,sandybridge
longrun    128600Mb    24    395    394      1  hsw,haswell
longrun    258000Mb    24     10     10      0  hsw,haswell
test        64300Mb    16      2      2      2  snb,sandybridge
test       128600Mb    24      2      2      1  hsw,haswell
hugemem   1551000Mb    40      4      4      0  bigmem,hsw,haswell,ssd
hugemem   1551000Mb    32      2      2      0  bigmem,snb,sandybridge

The sample listing above tells e.g. that resource pool test contains 2 Sandy Bridge nodes, each having 64 GB of memory and 16 cores. In addition, the test pool includes also 2 Haswell nodes each having 24 cores and 128 GB of memory.

Table 3.2 Available batch job queues in supercluster taito.csc.fi.

Queue Maximum number of cores Maximum run time Maximum total memory
serial (default) 16 / 24 (one node*) 3 days 256 GB
parallel 448 / 672 (28 nodes*) 3 days 256 GB
longrun 16 / 24 (one node*) 14 days 256 GB
hugemem 40 (one node) 7 days 1.5 TB
test 32 / 48 (two nodes*) 30 min 64 GB
* Sandy Bridge / Haswell (one Sandy Bridge node consists of 16 and Haswell one of 24 cores)

 

3.1.4 Choosing between processor architectures

If a code is compiled with Haswell processor specific optimization parameters, it will not work in the Sandy Bridge processors. In these cases it is necessary to submit the job so that it will use only Haswell based nodes. This can be specified with by adding following constraint parameter to the batch job file:

#SBATCH --constraint=hsw

 

Similarly, if you for some reason want to use only Sandy bridge processors, you should use constraint:

#SBATCH --constraint=snb

 

By default the serial queue (jobs that fit inside one node, i.e., 1-16 cores for Sandy Bridge or 1-24 cores for Haswell nodes) will use resources from either architecture if they are available. This will minimize queueing and maximise resource usage. If you want to use only one of the architectures, use the constraint option as shown above.

In the case of parallel or test jobs, all the cores that the job uses, must be either Sandy Bridge or Haswell. By default jobs submitted to parallel or test queues will use Sandy Bridge or Haswell. Thus, if you want to use Haswell processors for parallel computing, you must add #SBATCH --constraint=hsw to your batch script, e.g.,:

#!/bin/bash -l
#SBATCH -J test_hsw
#SBATCH -o output.txt
#SBATCH -e errors.txt
#SBATCH -t 00:01:00
#SBATCH -p test
#SBATCH --constraint=hsw
#
./my_test_job

 

For mpi-only parallel jobs, the recommended way to reserve resources is simply to ask for cores. SLURM will allocate as many nodes as needed. The number of nodes will depend whether Sandy Bridge or Haswell nodes are used, but not specifying the number of nodes and tasks-per-node in advance minimizes the risk of human error and underfilling the nodes.

In the case of Hugemem jobs architecture definition can be utilized too. By adding defintion --constraint=hsw  to your batch job script you can ensure that in the newer Haswell based hugemem nodes that have the fast SSD based local temporary storage.

 

3.1.5 Using Scientist's User Interface to execute batch jobs

My Files tool in Scientist's User interface web portal (https://sui.csc.fi/group/sui/my-files) can be used to transfer and access data in CSC's storage systems (see Chapter 5.1 of CSC computing environment user guide for details). In addition to data management, My Files allows users to submit batch jobs for execution. In My Files, select computing host (for example, Taito) and then browse in $WRKDIR in directory where your job script is saved. Then select job script file and right-click with mouse. This will open a context menu showing action "Submit Batch Job". Selecting this action will send your job script for computation. 

Figure 3.2 Submitting job with My Files in Scientist's User Interface (https://sui.csc.fi/group/sui/my-files)

 

Previous chapter     One level up     Next chapter