Jobs with data and private images

Here we give an example of running a job which requires a private container image, input files and output files. We assume that the container image needs to be kept private and therefore cannot be put on Docker Hub.

The basic workflow is:

Upload the container image to object storage
Upload any required input data to object storage
Define and execute the job
Download the output data

The container image

Saving the image into a file

We will use the object storage integrate with PROMINENCE for storing the container image. The image will only be visible to the user who uploads it and to the user’s jobs which require it. Because the image will be stored in object storage rather than a container registry the image needs to be in the form of a single file. This can either be a Docker archive (.tar) or Singularity Image Format (.sif). Note that images in the Singularity Image Format are typically much smaller than Docker archives.

A Docker archive can be created using the Docker CLI, for example:

docker save centos:7 > centos7.tar

Alternatively an image built using Docker can be saved in the Singularity Image Format easily, for example:

singularity build centos7.sif docker-daemon://centos:7

Uploading the image

Using the PROMINENCE CLI the container image can be uploaded to object storage. For example:

prominence upload --filename centos.sif --name centos7.sif

Here filename is the name of the file on your local system, and name is an arbitrary name which will be used later to reference the file.

Input data

Similarly, any private input data required by jobs can be uploaded to object storage using prominence upload. For example, suppose you have a file cadmesh-jet-v1.1.tgz required by jobs:

prominence upload --filename cadmesh-jet-v1.1.tgz --name cadmesh-jet-v1.1.tgz

Submitting the job

We now want to submit a job which runs a command inside the image created in the first step which accesses the input data. Here is a simple example:

prominence create --name test --artifact cadmesh-jet-v1.1.tgz centos7.sif "du -a ."

Here that we have used the container image name specified about (centos7.sif) and the name given to the input data (cadmesh-jet-v1.1.tgz). In this case the input file is a gzip compressed tar archive. Before the user’s command is executed this file will be decompressed and the files extracted, and made available in the current working directory of the job.

Small input data files or scripts

For the case of small input data files or scripts (< 1 MB in size) it is not necessary to upload the files to object storage. Instead they can be directly specified when the job is submitted. For example, suppose you have a file data.txt containing:

0 1 2 A
3 4 5 B

and a Python script test.py containing:

with open('data.txt', 'r') as input_file:
    print(input_file.readlines())

you can submit this directly in one line without any additional preparations:

prominence create --input data.txt --input test.py python:3 "python3 test.py"

This simple example also demonstrates that input files are placed into the current working directory inside the container.

It is of course possible for a job to use both small input files and larger files stored on object storage.

Output data

If a job generates output data which is required by the user, either use the --output option to specify the name of an output file or use --outputdir to specify the name of an output directory. For the case of a directory, it will be automatically compressed into a tarball.

Here is a simple example with a single output file output.nc:

prominence create --name test \
                  --artifact cadmesh-jet-v1.1.tgz \
                  --output output.nc \
                  centos7.sif "touch output.nc"

For multiple output files use the --output option multiple times, e.g.:

prominence create --name test \
                  --artifact cadmesh-jet-v1.1.tgz \ 
                  --output output1.nc \
                  --output output2.nc \
                  centos7.sif "/bin/bash -c \"touch output1.nc ; touch output2.nc\""

Alternatively, if the job puts all output files into a single directory:

prominence create --name test \
                  --artifact cadmesh-jet-v1.1.tgz \ 
                  --outputdir output \
                  centos7.sif "/bin/bash -c \"mkdir output ; touch output/file1.txt ; touch output/file2.txt\""

In this example all output files are in the directory output.

Monitoring the job

prominence list can be used to list the status of all active jobs (i.e. idle or running). prominence describe can be used to get more information about a specific job.

The command prominence exec can be used to execute a command within a running job. Usage is of the form:

prominence exec <job id> <command>

Examples include checking what processes are running, listing files or looking at the content of files.

It’s also possible to download files from a running job using prominence snapshot. Usage is of the form:

prominence shapshot <job id> <file or directory>

A tarball is created of the file or directory, uploaded to object storage and downloaded to the machine where the CLI is run.

Note that for the case of multi-node jobs prominence exec and prominence snapshot use the first node.

Downloading output data

Once a job has completed the prominence download command can be used to download any output files or directories associated with a job, e.g.

prominence download <job id>