How to get information about your jobs

Suppose that you have submitted a job and you would like some information about that job. There are a variety of of ways that you can access a wealth of job status information.

Task 1: List all jobs

Before getting information about jobs, it is critical to see what is actually running, or even what has previously run.

Show every running job along with the JOBID, user who submitted, etc

Code: squeue

Sample output:

output1

Show a history of your jobs: completed, running, and failed

Code: sacct

Sample output:

output2

The JOBID is particular useful. We will use it when we need to access information for a specific job. In this tutorial, whenever a command asks for <JOBID>, replace it with your actual JOBID as found with this command.

Graphically display all the jobs, with a bit of info

This command will only work if your SSH access is set up such that graphical applications my be displayed over the network.

Code: sview

Sample output:

output3

Task 2: Display job information

Get detailed information about a running job with sstat

The sstat command gives a wealth of job status information. However, using it requires that a job be submitted with the 'srun' command. The following slurm script shows a C program that is being executed by mpirun, but also through srun. The C program would have run just fine without srun, but if we do not supply the srun portion of the command then sstat will not provide any information.

Sample slurm script: 

Code
#!/bin/bash
#SBATCH --job-name=C_hello
#SBATCH --output=slurm_c.out
#SBATCH --error=slurm_c.err
#SBATCH --partition=normal
#SBATCH -N 1 #SBATCH -t 04:30:00
#SBATCH -n 4 ##SBATCH --cpus-per-task 4
srun mpirun ./hello
 
Now, do the following:
Step 1:  run the script with sbatch to schedule the job:

Code: sbatch hello_c.slurm

Step 2: find the JOBID:

Code: squeue

Step 3: use that JOBID to get info with sstat:
Code: sstat <JOBID>
Sample Run:
[ekrell@hpcm ekrell]$ sbatch hello_c.slurm 
Submitted batch job 5198
[ekrell@hpcm ekrell]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5107 cbirdq full cbird R 7-14:27:17 1 hpcc26
5166 cbirdq MaNib cbird R 1-01:04:41 1 hpcc25
5142 normal BiNib cbird R 3-03:52:47 1 hpcc01
5168 normal OaNib cbird R 1-00:56:18 1 hpcc02
5178 normal nullRows sterba R 1:55:49 9 hpcc[03-11]
5198 normal C_hello ekrell R 0:07 1 hpcc13
5185 serial test1.sh tmerrick R 58:35 1 hpcc12
[ekrell@hpcm ekrell]$ sstat 5198 JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AveCPUFreq ReqCPUFreq ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite
------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- -------------- ------------ --------------- --------------- ------------ ------------ ---------------- ---------------- ------------
sstat: WARNING: We will use a much slower algorithm with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other proctrack when using jobacct_gather/linux
5198.0 1171496K hpcc13 0 1171496K 46688K hpcc13 3 46678K 26K hpcc13 3 23K 00:00.000 hpcc13 0 00:00.000 4 2.80G 0 1M hpcc13 0 1M 0.13M hpcc13 0 0.13M
[ekrell@hpcm ekrell]$