This lesson is being piloted (Beta version)

Job queueing and control in SuperPOD

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • How to run control Job in SuperPOD

Objectives
  • To teach command to work with Job in SLURM

The SuperPOD cluster uses the Simple Linux Utility for Resource Management system (SLURM) to manage jobs.

5b. Job Queue and Control

In SLURM there are several usefull commands for checking your job:

Lifecycle of a Job

The life of a job begins when you submit the job to the scheduler. If accepted, it will enter the Queued state.

Thereafter, the job may move to other states, as defined below:

image

Useful Commands

Here are some basic SLURM commands for submitting, querying and deleting jobs in SuperPOD:

Command Actions
srun -N1 -G1 --pty $SHELL Submit an interactive job (reserves 1 Node, 1GPU, 1CPU, 6gb RAM, 1 hour walltime)
sbatch job.sh submit the job script job.sh
sstat <job id> Check the status of the job given jobID
sstat <job id> --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID Narrow some information on sstat
squeue -u <username> Check the status of all jobs submitted by given username
scontrol show job <job id> Check the detailed information for job with given job ID
scancel <job id> Delete the queued or running job given job ID

Check pending, working job:

$ squeue -u $USERNAME

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON
12345  workshop     bash     tuev  R      39:46      1 bcm-dgxa100-0002

The above Job has a JOBID=12345, which will be used below:

Check configuration of any requested job using JOBID:

$ scontrol show job 12345 grep ReqTRES

ReqTRES=cpu=5,mem=30G,node=1,billing=5,gres/gpu=1

Delete any job

$ scancel 12345

Checking how your job is running in node

When you know your working node, for example bcm-dgxa100-0001, from login node, you can login to the compute node and check the processing:

$ ssh bcm-dgxa100-0001
$ top -u $USERNAME
$ ssh bcm-dgxa100-0001
$ nvidia-smi
OR to refresh the command every 0.2s
$ watch -n .2 nvidia-smi

Key Points

  • Job queue, control