3.1 Constructing a batch job file
The most common way to use the SLURM batch job system is to first create a batch job file that is submitted to the scheduler with command sbatch. You can create batch job files with normal text editors or you can use the Batch Job Script Wizard tool, in the Scientist's User Interface( https://sui.csc.fi/group/sui/batch-job-script-wizard ), (see Figure 3.1). In the Batch Job Script Wizard, you first select the server you want to use and then fill in the settings for the batch job. The Batch Job Script Wizard can't directly submit the job, but with the "Save Script" you can save the batch job file directly to your home directory at CSC. After that you can use the My Files tool to further edit and launch the batch job (see paragraph 3.1.5).
Figure 3.1 Batch Job Script Wizard in the scientist's user interface (https://sui.csc.fi/group/sui/batch-job-script-wizard)
Below is an example of a SLURM batch job file made with a text editor:
#!/bin/bash -l #SBATCH -J hello_SLURM #SBATCH -o output.txt #SBATCH -e errors.t #SBATCH -t 01:20:00 #SBATCH -p serial # echo "Hello SLURM"
The first line of the batch job file (#!/bin/bash -l) defines that the bash shell will be used. The following five lines contain information for the batch job scheduler. The syntax of the lines is:
#SBATCH -sbatch_option argument
In the example above we use five sbatch options: -J that defines a name for the batch job (hello_SLURM in this case), -o defines file name for the standard output and -e for the standard error. -t defines that the maximum duration of the job is in this case 1 hour and 20 minutes. -p defines that the job is to be send to serial partition. After the batch job definitions comes the commands that will be executed. In this case there is just one command: echo "Hello SLURM" that prints text "Hello SLURM" to standard output.
The batch job file above can be submitted to the scheduler with command:
The batch job file above includes only the most essential job definitions. However, it is often mandatory or useful to use other sbatch options too. The options needed to run parallel jobs are discussed more in detail in the following chapters. Table 3.1 contains some of the most commonly used sbatch options. The full list of sbatch options can be listed with command:
Table 3.1 Most commonly used sbatch options
|--begin=time||Defer job until HH:MM MM/DD/YY.|
|-c, --cpus-per-task=ncpus||Number of cpus required per task.|
|-C, --constraint=value||In Taito, the --constraint option can be used to select the processor type to be used (hsw = Haswell, snb = Sandy Bridge, "[hsw|snb]" = either snb or hsw, ssd = node with ssd based temporary directory )|
|-d, --dependency=type:jobid||Defer job until condition on jobid is satisfied.|
|-e, --error=err||File for batch script's standard error.|
|--ntasks-per-node=n||Number of tasks to per node.|
|-J, --job-name=jobname||Name of the job.|
|--mail-type=type||Notify on state change: BEGIN, END, FAIL or ALL.|
|--mail-user=user||Who to send email notification for job state changes.|
|-n, --ntasks=ntasks||Number of tasks to run.|
|-N, --nodes=N||Number of nodes on which to run.|
|-o, --output=out||File for batch script's standard output.|
|-t, --time=minutes||Time limit in format hh:mm:ss.|
|--mem=MB||Maximum amount of real memory per node required by the job in megabytes. (Recommended for serial jobs and shared memory parallel jobs)|
|--mem-per-cpu=MB||Maximum amount of real memory per allocated CPU required by the job in megabytes.(Recommended for MPI parallel jobs)|
|-p||Specify queue (partition) to be used. In Taito the available queues are: serial, parallel, longrun, test and hugemem.|
In the second batch job example below options --mail-type and --mail-user are used to make the batch system to send e-mail to address email@example.com when to job ends. Further the job is defined to reserve 4GB of memory. In the output and error file definitions %j is used to use the job id-number in the file name, so that if the same batch job file is used several times, the old output and error files will not get overwritten.
#!/bin/bash -l #SBATCH -J hello_SLURM #SBATCH -o output_%j.txt #SBATCH -e errors_%j.txt #SBATCH -t 01:20:00 #SBATCH -n 1 #SBATCH -p serial #SBATCH --mail-type=END #SBATCH --firstname.lastname@example.org #SBATCH --mem-per-cpu=4096 # echo "Hello SLURM" ./my_command
3.1.3 Queues and resource requests
Setting optimal values for the requested computing time, memory and number of cores to be used is not always a simple task. It is often useful to first send short test jobs to get a rough estimate of the computing time and memory requirements of the job. It is safer to reserve more computing time than needed, but remember that jobs with large computing time request may, and often have to, wait longer time in the queue than shorter jobs.
All the batch queues have maximum durations and maximum amount of nodes that a job can use. You can check these limits with command sinfo. For example:
sinfo -o "%10P %.5a %.10l %.10s %.16F " PARTITION AVAIL TIMELIMIT JOB_SIZE NODES(A/I/O/T) serial* up 3-00:00:00 1 768/98/1/867 parallel up 3-00:00:00 1-28 768/98/1/867 longrun up 14-00:00:0 1 767/95/1/863 test up 30:00 1-2 1/3/0/4 hugemem up 7-00:00:00 1 6/0/0/6
The sinfo output above tells that the cluster has five partitions ( parallel, serial, longrun, test and hugemem). For example, the maximum execution time in parallel queue is three days (3-00:00:00) and the jobs can use use up to 28 Haswell nodes ( 28 * 24= 672 cores). Similarly the maximum duration of jobs submitted to test queue is 30 minutes ( 30:00).
The cluster partition you are using should match the reservations for computing time, core number and memory. By default a job is submitted to the serial partition, where you can run serial jobs or parallel jobs that use up to 24 cores (one Haswell node) and require at most three days of run time. The maximum memory that can be reserved for a job in the serial partition is 256 GB. If your job requests exceeds these limits, you must use option -p to choose a partition, which meets the resource requests.
#!/bin/bash -l #SBATCH -J longrun_SLURM #SBATCH -o output.txt #SBATCH -e errors.txt #SBATCH -t 6-00:00:00 #SBATCH -p longrun # ./my_long_job
#!/bin/bash -l #SBATCH -J longrun_SLURM #SBATCH -o output.txt #SBATCH -e errors.txt #SBATCH -t 06:00:00 #SBATCH -n 1 #SBATCH --mem-per-cpu=1000000 #SBATCH -p hugemem # ./my_bigmemory_job
Estimating the memory request is even more difficult as it is dependent on several things like algorithm and software and the analysis task. In most case 1-4 GB is enough but you may need to increase the memory size in the case of some application.
Command sjstat can be used to check the available memory for nodes in different partitions. The sjstat command lists the scheduling pool data and the running jobs. The scheduling pool data can be used to check the available memory in different partitions. You can check just the scheduling pool data by adding option -c to the command:
sjstat -c Scheduling pool data: ------------------------------------------------------------- Pool Memory Cpus Total Usable Free Other Traits ------------------------------------------------------------- serial* 64300Mb 16 450 450 118 snb,sandybridge serial* 128600Mb 24 395 394 1 hsw,haswell serial* 258000Mb 24 10 10 0 hsw,haswell serial* 258000Mb 16 12 12 3 bigmem,snb,sandybridge parallel 64300Mb 16 450 450 118 snb,sandybridge parallel 128600Mb 24 395 394 1 hsw,haswell parallel 258000Mb 24 10 10 0 hsw,haswell parallel 258000Mb 16 12 12 3 bigmem,snb,sandybridge longrun 64300Mb 16 450 450 118 snb,sandybridge longrun 258000Mb 16 8 8 0 bigmem,snb,sandybridge longrun 128600Mb 24 395 394 1 hsw,haswell longrun 258000Mb 24 10 10 0 hsw,haswell test 64300Mb 16 2 2 2 snb,sandybridge test 128600Mb 24 2 2 1 hsw,haswell hugemem 1551000Mb 40 4 4 0 bigmem,hsw,haswell,ssd hugemem 1551000Mb 32 2 2 0 bigmem,snb,sandybridge
The sample listing above tells e.g. that resource pool test contains 2 Sandy Bridge nodes, each having 64 GB of memory and 16 cores. In addition, the test pool includes also 2 Haswell nodes each having 24 cores and 128 GB of memory.
Table 3.2 Available batch job queues in supercluster taito.csc.fi.
|Queue||Maximum number of cores||Maximum run time||Maximum total memory|
|serial (default)||16 / 24 (one node*)||3 days||256 GB|
|parallel||448 / 672 (28 nodes*)||3 days||256 GB|
|longrun||16 / 24 (one node*)||14 days||256 GB|
|hugemem||40 (one node)||7 days||1.5 TB|
|test||32 / 48 (two nodes*)||30 min||64 GB|
3.1.4 Choosing between processor architectures
If a code is compiled with Haswell processor specific optimization parameters, it will not work in the Sandy Bridge processors. In these cases it is necessary to submit the job so that it will use only Haswell based nodes. This can be specified with by adding following constraint parameter to the batch job file:
Similarly, if you for some reason want to use only Sandy bridge processors, you should use constraint:
By default the serial queue (jobs that fit inside one node, i.e., 1-16 cores for Sandy Bridge or 1-24 cores for Haswell nodes) will use resources from either architecture if they are available. This will minimize queueing and maximise resource usage. If you want to use only one of the architectures, use the constraint option as shown above.
In the case of parallel or test jobs, all the cores that the job uses, must be either Sandy Bridge or Haswell. By default jobs submitted to parallel or test queues will use Sandy Bridge or Haswell. Thus, if you want to use Haswell processors for parallel computing, you must add #SBATCH --constraint=hsw to your batch script, e.g.,:
#!/bin/bash -l #SBATCH -J test_hsw #SBATCH -o output.txt #SBATCH -e errors.txt #SBATCH -t 00:01:00 #SBATCH -p test #SBATCH --constraint=hsw # ./my_test_job
For mpi-only parallel jobs, the recommended way to reserve resources is simply to ask for cores. SLURM will allocate as many nodes as needed. The number of nodes will depend whether Sandy Bridge or Haswell nodes are used, but not specifying the number of nodes and tasks-per-node in advance minimizes the risk of human error and underfilling the nodes.
In the case of Hugemem jobs architecture definition can be utilized too. By adding defintion --constraint=hsw to your batch job script you can ensure that in the newer Haswell based hugemem nodes that have the fast SSD based local temporary storage.
3.1.5 Using Scientist's User Interface to execute batch jobs
My Files tool in Scientist's User interface web portal (https://sui.csc.fi/group/sui/my-files) can be used to transfer and access data in CSC's storage systems (see Chapter 5.1 of CSC computing environment user guide for details). In addition to data management, My Files allows users to submit batch jobs for execution. In My Files, select computing host (for example, Taito) and then browse in $WRKDIR in directory where your job script is saved. Then select job script file and right-click with mouse. This will open a context menu showing action "Submit Batch Job". Selecting this action will send your job script for computation.
Figure 3.2 Submitting job with My Files in Scientist's User Interface (https://sui.csc.fi/group/sui/my-files)
|Previous chapter||One level up||Next chapter|