3.2 Constructing a batch job file

The most common way to use the SLURM batch job system is to first create a batch job file which is submitted to the scheduler with the command sbatch. You can create batch job files with normal text editors or you can use the Batch Job Script Wizard tool in the Scientist's User Interface web portal (https://sui.csc.fi/group/sui/batch-job-script-wizard). You can also submit your job scripts in the Scientist's User Interface by using its My Files tool (https://sui.csc.fi/group/sui/my-files) - select the batch job file, right-click on it and choose Submit Batch Job.

In the case of Sisu, many job-specific parameters are defined only with the aprun job submission command and not as a SLURM batch job definitions. At least you need to define only two SLURM parameters:

  • Partition to be used ( test, test_large, small, small_long, large or gc)
  • The amount of computing resources ( i.e. nodes). Normally it is useful also to define the computing time. The amount of computing resources to be used can be determined in two alternative ways:
 

1. You can reserve certain number of nodes (each having 24 computing cores) with the SLURM option -N

2. Alternatively you can define the total number of cores to be used with the option -n and then the distribution of the cores with the option: --ntasks-per-nodes.

We recommend that you use full computing nodes for running jobs if possible. In this case reserving the resources by defining the number of nodes with -N is often more convenient.

The minimum size of a parallel job is 3 nodes (72 cores).  By default the maximum size of a job is 42 nodes (1008 cores).  If a user wishes to submit larger jobs (up to 400 nodes), the parallel performance of the software needs to be demonstarted first wih a scalability test.



Below is shown an example of a SLURM batch job for Sisu:

#!/bin/bash -l
#SBATCH -J test_job
#SBATCH -o test_job%J.out
#SBATCH -e test_job%J.err
#SBATCH -t 05:30:00
#SBATCH -N 8
#SBATCH -p small

(( ncores = SLURM_NNODES * 24 ))
echo "Running namd with $SLURM_NNODES nodes containing total of $ncores cores"
module load namd
aprun -n $ncores namd2 namd.run

The first line of the batch job file ( #!/bin/bash -l) defines that the bash shell will be used. The flag -l makes bash act as if it had been invoked as a login shell, and allows, e.g., to call the module command within the script, if needed. The following six lines contain information for the batch job scheduler. The syntax of the lines is

#SBATCH -sbatch_option argument
In the example above we use six sbatch options:

-J that defines a name for the batch job (test_job in this case)
-o defines file name for the standard output and
-e for the standard error
-t defines the maximum duration of the job, in this case 5 hours and 30 minutes
-N defines that the job will use 8 nodes ( containing total of 8 x 24 = 192 computing cores)
-p defines the partition (queue) the job will be sent to, i.e., small

In the output and error file definitions notation %J is used to use the job id-number in the file name, so that if the same batch job file is used several times the old output and error files will not get overwritten.

After the batch job definitions, one inserts the commands that will be executed. In the example above, the script calculates the number of cores to be used ( $ncores ) so that changes in the number of nodes is automatically taken into account with the aprun command. Finally, command: module load namd sets up the namd molecular dynamics application and the aprun command launches the actual namd job. The job can be submitted to the batch job system with the command:

sbatch file_name.sh

The batch job file above includes only the most essential job definitions. However, it is often mandatory or useful to use several other sbatch options too. The options needed to run parallel jobs are discussed more in detail in the following chapters. Table 3.3 contains some of the most commonly used sbatch options. The full list of sbatch options can be listed with command:

sbatch -h

 

Table 3.3 Commonly used sbatch options applicable in Sisu supercomputer.

Slurm option Description
--begin=time Defer job until HH:MM MM/DD/YY
-d, --dependency=type:jobid Defer job until condition on jobid is satisfied
-e, --error=err File for batch script's standard error
-J, --job-name=jobname Name of job.
--mail-type=type Notify on state change: BEGIN, END, FAIL or ALL.
--mail-user=user Who to send email notification for job state changes.
-N, --nodes=N Number of nodes on which to run.
-o, --output=out File for batch script's standard output.
-t, --time=minutes Time limit in format hh:mm:ss.

 

List of batch job examples in this guide:

    Previous chapter     One level up     Next chapter