3.1 Using SLURM commands to execute batch jobs


The basic SLURM commands for submitting batch jobs are sbatch, which submits jobs to batch job system, and scancel, which can used to stop and remove a queueing or running job. The basic syntax of the sbatch command is:

sbatch -options batch_job_file

Normally sbatch options are included in the batch job file, but you can use the options listed in table 3.3 also in the command line. For example:

sbatch -J test2 -t 00:05:00 batch_job_file.sh

If the same option is used both in the command line and in the batch job file the value defined in the command line overrides the value in the batch job file. When the job is successfully launched the command prints out a line, telling the ID number of the submitted job. For example:

Submitted batch job 6594
The job ID number can be used to monitor and control the job. For example, the job with ID 6594 could be cancelled with command:
scancel 6594

The progress of the submitted batch jobs can be followed with commands squeue, apstat, sinfo, and sacct. These commands can also be used to check the status and parameters of the SLURM environment. By default the squeue command lists all the jobs which are submitted to the scheduler. If you want to see the status of your own jobs, you can use the command:

squeue -l -u username 
or
squeue -l -u $USER

You can also check the status of a specific job by defining the jobid with -j switch. The option -p partition will display only jobs on a specific SLURM partition. The partitions of the system can be checked with the command sinfo, which shows information about SLURM nodes and partitions. sinfo shows, for example, which nodes are allocated and which are free:

[kkmattil@sisu-login5:~/.globus> sinfo -all
Tue Sep  9 14:58:49 2014
PARTITION AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE      NODELIST
small     up    3-24       12:00:00     48 2:12:2       4 maint      nid0[1376-1379]
small     up    3-24       12:00:00     48 2:12:2       8 idle*      nid00[036-039,516-519]
small     up    3-24       12:00:00     48 2:12:2       1 down*      nid00933
small     up    3-24       12:00:00     48 2:12:2    1594 allocated  nid0[0016-0035,0040-0169,0200-0210,0254-0383,0392-0515,0520-0574,0584-0767,0772-0932,0934-0958,0960-1342,1344-1375,1380-1718]
small     up    3-24       12:00:00     48 2:12:2      64 idle       nid00[170-190,211-253]
large     up    24-400   3-00:00:00     48 2:12:2       4 maint      nid0[1376-1379]
large     up    24-400   3-00:00:00     48 2:12:2       8 idle*      nid00[036-039,516-519]
large     up    24-400   3-00:00:00     48 2:12:2       1 down*      nid00933
large     up    24-400   3-00:00:00     48 2:12:2    1594 allocated  nid0[0016-0035,0040-0169,0200-0210,0254-0383,0392-0515,0520-0574,0584-0767,0772-0932,0934-0958,0960-1342,1344-1375,1380-1718]
large     up    24-400   3-00:00:00     48 2:12:2      64 idle       nid00[170-190,211-253]
test_larg up    1-800       4:00:00     48 2:12:2       4 maint      nid0[1376-1379]
test_larg up    1-800       4:00:00     48 2:12:2       8 idle*      nid00[036-039,516-519]
test_larg up    1-800       4:00:00     48 2:12:2       1 down*      nid00933
test_larg up    1-800       4:00:00     48 2:12:2    1594 allocated  nid0[0016-0035,0040-0169,0200-0210,0254-0383,0392-0515,0520-0574,0584-0767,0772-0932,0934-0958,0960-1342,1344-1375,1380-1718]
test_larg up    1-800       4:00:00     48 2:12:2      64 idle       nid00[170-190,211-253]
gc        up    24-800   1-00:00:00     48 2:12:2       4 maint      nid0[1376-1379]
gc        up    24-800   1-00:00:00     48 2:12:2       8 idle*      nid00[036-039,516-519]
gc        up    24-800   1-00:00:00     48 2:12:2       1 down*      nid00933
gc        up    24-800   1-00:00:00     48 2:12:2    1594 allocated  nid0[0016-0035,0040-0169,0200-0210,0254-0383,0392-0515,0520-0574,0584-0767,0772-0932,0934-0958,0960-1342,1344-1375,1380-1718]
gc        up    24-800   1-00:00:00     48 2:12:2      64 idle       nid00[170-190,211-253]
test*     up    1-24          30:00     48 2:12:2       4 maint      nid0[1376-1379]
test*     up    1-24          30:00     48 2:12:2       8 idle*      nid00[036-039,516-519]
test*     up    1-24          30:00     48 2:12:2       1 down*      nid00933
test*     up    1-24          30:00     48 2:12:2    1594 allocated  nid0[0016-0035,0040-0169,0200-0210,0254-0383,0392-0515,0520-0574,0584-0767,0772-0932,0934-0958,0960-1342,1344-1375,1380-1718]
test*     up    1-24          30:00     48 2:12:2      72 idle       nid0[0170-0190,0211-0253,1719-1726]

 

The command scontrol allows to view SLURM configuration and state. To check when a job, waiting in the queue, is estimated to be executed, the command scontrol show job jobid can be used. A row "StartTime=..." gives an estimate on the job start-up time. It may happen that the job execution time can not be approximated, in which case the values is "StartTime= Unknown". The "StartTime" may change, i.e. be shortened, as the time goes.


Table 3.2 Most frequently used SLURM commands.

Command Description
sbatch Submit a job script to a queue.
scancel Signal jobs or job steps that are under the control of SLURM (cancel jobs or job steps).
sinfo View information about SLURM nodes and partitions.
squeue View information about jobs located in the SLURM scheduling queue.
smap Graphically view information about SLURM jobs, partitions, and set configurations parameters.
scontrol View SLURM configuration and state.

 

    Previous chapter     One level up     Next chapter