Job status
Listing jobs
The list
command by default lists any active jobs, i.e. jobs which are idle or running:
prominence list
output
ID STATUS IMAGE CMD ARGS
3101 idle alahiff/testpi
3103 idle alahiff/cherab-jet:latest python batch_make_sensitivity_matrix.py 0 59
3104 idle ikester/blender:latest blender -b classroom/classroom.blend -o frame_### -f 1
It’s also possible to request a list of jobs using a constraint on the labels associated with each job. For example, if you submitted a group of jobs with a label name=run5, the following would list all such jobs:
prominence list --all --constraint name=run5
Here the --all
option means that both active (i.e. idle or running) and completed jobs will be listed.
Describing a job
To get more information about an individual job, use the describe
command, for example:
prominence describe 345
output
[
{
"id": 345,
"status": "created",
"resources": {
"cpus": 1,
"disk": 10,
"memory": 1,
"nodes": 1,
"walltime": 720
},
"tasks": [
{
"image": "alahiff/testpi",
"runtime": "singularity"
}
],
"events": {
"createTime": "2019-06-18T10:16:36"
}
}
]
To show information about completed jobs, both the list
and describe
commands accept a --completed
option. For example, to list the last 2 completed jobs:
prominence list --completed --last 2
output
ID STATUS IMAGE CMD ARGS
2980 completed alahiff/tensorflow:1.11.0 python models-1.11/official/mnist/mnist.py --export_dir mnist_saved_model
2982 completed alahiff/tensorflow:1.11.0 python models-1.11/official/mnist/mnist.py --export_dir mnist_saved_model
Note that jobs which are completed or have been removed for some reason may be visible briefly without using the --completed
option.
Completed jobs
The JSON descriptions of completed jobs contain additional information. This may include:
- status: current job status (
idle
,running
,completed
,failed
,deleted
, orkilled
) - statusReason: for jobs in a terminal state other than the completed state this may give a reason for the current status.
- createTime: date & time when the job was created by the user.
- startTime: date & time when the job started running.
- endTime: date & time when the job ended.
- site: the site where the job was executed.
- maxMemoryUsageKB: the maximum total memory usage of the job, summed over all processes (note this is not available for jobs running on remote HTC or HPC resources)
- retries: the number of job retries attempted.
- provisionedResources: the number of CPU cores, memory (in GB), disk (in GB) and number of nodes provisioned for the job.
- cpu: details of the CPU used to run the job (
clock
,model
, andvendor
). - runtimeVersion: the versions of
singularity
andudocker
(in the form<version>/<tarball_release>
) used.
The following information is also provided for each task:
- retries: the number of task retries attempted.
- exitCode: the exit code returned by the user’s job. This would usually would be 0 for success.
- imagePullTime: time taken to pull the container image. If a cached image from a previous task was used this will be -1.
- imagePullStatus: image pull status, e.g.
completed
. - imageSha256: SHA256 sum of the container image.
- wallTimeUsage: wall time used by the task.
- cpuTimeUsage: CPU time usage by the task. For a task using multiple CPUs this will be larger than the wall time.
- maxResidentSetSizeKB: maximum resident size (in KB) of the largest process.
- stageInTime: total time to stage-in any files.
- stageOutTime: total time to stage-out any files.
For example:
{
"id": 61,
"status": "completed",
"resources": {
"cpus": 1,
"disk": 10,
"memory": 1,
"nodes": 1
},
"tasks": [
{
"image": "eoscprominence/testpi",
"runtime": "singularity"
}
],
"events": {
"createTime": "2022-04-28 20:03:06",
"startTime": "2022-04-28 20:04:06",
"endTime": "2022-04-28 20:04:12"
},
"execution": {
"site": "SLURM-CSD3",
"provisionedResources": {
"cpus": 1,
"disk": 154,
"memory": 1,
"nodes": 1
},
"cpu": {
"clock": "3039.367",
"model": "Intel(R) Xeon(R) Platinum 8276 CPU @ 2.20GHz",
"vendor": "GenuineIntel"
},
"runtimeVersion": {
"singularity": "3.8.7-1.el7"
},
"tasks": [
{
"exitCode": 0,
"retries": 0,
"imagePullTime": 1.424,
"imageSha256": "c694b456f242e484753a83107e47ca29d5a08e784c45f4225b1758713ad2e236",
"wallTimeUsage": 0.3726,
"cpuTimeUsage": 0.3326,
"maxResidentSetSizeKB": 38384
}
]
}
}