3.2 Using SLURM commands to execute batch jobs
The basic SLURM commands for submitting batch jobs are sbatch that submits jobs to batch job system and scancel that can be used to stop and remove a queueing or a running job. The basic syntax of the sbatch command is:
sbatch -options batch_job_file
Normally the sbatch options are included in the batch job file, but you can use the options listed in Table 3.1, in command line too. For example:
sbatch -J test2 -t 00:05:00 batch_job_file.sh
If the same option is used both in command line and in the batch job file, the value defined in the command line overrides the value in the batch job file. When the job is successfully launched, the command prints out a line, telling the ID number of the submitted job. For example:
Submitted batch job 6594
The job ID number can be used to follow the progress of the job or to remove it. For example, a job with ID 6594 can be removed from the batch job system with command:
The number of jobs, that a single user can have in the batch job system of Taito at once, has been limited to 896, to prevent batch job system from overloading.
Progress of the submitted batch jobs can be followed with commands squeue, sjstat and sacct. These commands can also be used to check the status and parameters of the batch job environment. squeue, sjstat and sacct usage examples are given below.
squeue -l -u usernameor
squeue -l -u $USER
You can also check the status of a specific job by defining the jobid with -j switch. Using option -p partition will display only jobs on that SLURM partition.
Command scontrol allows to view SLURM configuration and state. To check when the job waiting in the queue will be executed, the command scontrol show job jobid can be used. A row "StartTime=..." gives an estimate on the job start-up time. It may happen that the job execution time can not be approximated, in which case "StartTime= Unknown". Note, that the "StartTime" may change, e.g., be shortened, as the time goes.
The sacct command can be used to study the log file of the batch job system. Thus it can show information about both active jobs and jobs that have already finished. By default the sacct command shows information about users' own jobs. The sacct command has a wide selection of options and parameters that can be used to select the data to be displayed. By default sacct displays information from the time period that starts from the midnight of current day. You can change the starting date with option -S YYYY-MM-DD. For example, to list the information since first of February 2015 you can use command:
sacct -S 2015-02-01
Information about specific jobs can be checked with option -J job-ID. For example detailed information about job number 6594 could be shown with command:
sacct -S 2013-02-01 -j 6594 -l
Quite often the full listing of the job information is not desirable. To choose only specific information, you can use option -o combined with the list of fields to display. For example:
[kkayttaj@taito-login4~]$ sacct -j 6594 -o MaxRSS,AveRSS,ReqMem,Elapsed,AllocCPUS MaxRSS AveRSS ReqMem Elapsed AllocCPUS ---------- ---------- ---------- ---------- --------- 2347Mc 02:01:49 4 3480116K 3480116K 2347Mc 02:01:49 1
In the example above, the listing shows that job 6594 used 3.5 GB (3480116 KB) of memory and lasted 2 hours, 1 minute and 49 seconds. This information could then be used to optimize batch job parameters for other similar jobs.
When a batch job has finished it is good to run seff command to check the efficiency of your job. The syntax of the seff command is:
A sample session below shows a case where a job (job_id: 54321) took 49 min and 19 s and used the reserved CPU-resources rather efficiently (98.68% efficiency). In the cases of memory, nearly 40 GB was reserved but only bit over 4 GB was used in maximum. Thus for a second similar job, the user should consider decreasing the memory reservation.
[kkayttaj@taito-login4~] seff 54321 Job ID: 54321 Cluster: csc User/Group: kayttaj/somegroup State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:48:40 CPU Efficiency: 98.68% of 00:49:19 core-walltime Memory Utilized: 4.06 GB Memory Efficiency: 10.39% of 39.06 GB
Table 3.1 Most frequently used SLURM commands.
|sacct||Displays accounting data for all jobs.|
|salloc||Allocate resources for interactive use.|
|sbatch||Submit a job script to a queue.|
|scancel||Signal jobs or job steps that are under the control of SLURM (cancel jobs or job steps).|
|scontrol||View SLURM configuration and state.|
|seff||View the CPU and memory efficiency (real usage compared to the reserved resources)|
|sinfo||View information about SLURM nodes and partitions.|
|sjstat||Display statistics of jobs under control of SLURM (combines data from sinfo, squeue and scontrol).|
|smap||Graphically view information about SLURM jobs, partitions, and set configurations parameters.|
|squeue||View information about jobs located in the SLURM scheduling queue.|
|srun||Run a parallel job.|
|Previous chapter||One level up||Next chapter|