Example workflows

Identical jobs with different input files

Small files

Suppose you have a collection of small input files, e.g. input-0.txt, input-1.txt, input-2.txt, input-3.txt, and input-4.txt, and want to run the same command for each input file. One way of doing this is to use a workflow with a parameter sweep job factory.

A tarball can easily be created containing all the input files, e.g.

tar czvf inputs.tgz input-*.txt

and uploaded to object storage:

prominence upload --filename inputs.tgz --name inputs.tgz

A workflow can then be created in order to run the same job for each input file, for example:

{
   "name":"running-different-input-files",
   "jobs":[
      {
         "resources":{
            "nodes":1,
            "disk":10,
            "cpus":1,
            "memory":1
         },
         "name":"job",
         "artifacts":[
            {
               "url":"inputs.tgz"
            }
         ],
         "tasks":[
            {
               "image":"centos:7",
               "runtime":"singularity",
               "cmd":"/bin/bash -c \"cat input-${id}.txt > output-${id}.out\""
            }
         ],
         "outputFiles": [
             "output-${id}.out"
         ],
      }
   ],
   "factories":[
      {
         "type":"parameterSweep",
         "name":"sweep",
         "jobs":[
            "job"
         ],
         "parameters":[
            {
               "name":"id",
               "start":0,
               "end":4,
               "step":1
            }
         ]
      }
   ]
}

Here we just cat each file as a simple example, but the same idea can easily be applied to more complex use cases. Here we also specify a unique output file per job, where the filename depends on the parameter used. However, note that it is possible to use the same output filename for all jobs in a workflow if wanted.

Large files

For the situation of large input files it would not make sense to make a tarball containing all files. Instead, only the required input files should be provided to each job.

One method would be to upload each file to object storage, for example:

prominence upload --filename large-input-0.txt --name large-input-0.txt 
prominence upload --filename large-input-1.txt --name large-input-1.txt
prominence upload --filename large-input-2.txt --name large-input-2.txt

Note that the names can be arbitrary and don’t need to contain an integer which increments, as in this example we make use of a zip job factory rather than a parameter sweep. This means we can specify an explicit list of filenames. For example:

{
   "name":"running-different-large-input-files",
   "jobs":[
      {
         "resources":{
            "nodes":1,
            "disk":10,
            "cpus":1,
            "memory":1
         },
         "name":"job",
         "artifacts":[
            {
               "url":"$filename"
            }
         ],
         "tasks":[
            {
               "image":"centos:7",
               "runtime":"singularity",
               "cmd":"cat $filename"
            }
         ]
      }
   ],
   "factories":[
      {
         "type":"zip",
         "name":"zip",
         "jobs":[
            "job"
         ],
         "parameters":[
            {
               "name":"filename",
               "values":[
                  "large-input-0.txt",
                  "large-input-1.txt",
                  "large-input-2.txt"
               ]
            }
         ]
      }
   ]
}