DistribJob – A Distributed Job Controller

Overview

DistribJob is a distributed job controller: it distributes elements of an input set to nodes in a compute cluster for processing, and merges the result.

 

More specifically, the controller runs on a server.  It breaks a large input set into smaller sets called subtasks, and assigns the subtasks to compute slots on nodes (machines) as they become available.  On the node, it runs a command on the subtask’s input, and merges the result into the main result on the server.  It robustly tracks failures and can be restarted.

 

DistribJob does not do scheduling or load balancing.  For a particular job, it is given a static set of nodes, each with a static number of slots.  When a slot is vacated, DistribJob fills it with the next subtask.

 

DistribJob is written in Perl, using Perl objects and modules.

 

Caveat:  DistribJob is academic software.  It has been extensively used and tested, and is robust.  However, its installation, documentation and usage are not brought to the level of industry polish.  For example, some of the steps below forego thorough handholding, expecting instead for you to open a Perl file here or a test file there.  None of the steps are hard.  Just rough-cut.

Download

The DistribJob software is available for download from the GUS SVN repository. You will also need need the CBIL libraries also available from the GUS SVN repository.

Installation

Use the GUS Installer to install the DistribJob and CBIL software.

DistribJob comes in two modules: the DistribJob module itself, which is the core, and, the optional DistribJobTasks module, which includes a set of predefined bioinformatics tasks (“tasks” are defined below).  The latter are useful as samples, even if the particular tasks included are not relevant to your application.

 

To install either one, acquire the distribjob package (either through downloading or cvs).  Then follow the instructions in the DistribJob/readme.txt file or DistribJobTasks/readme.txt.  Installing DistribJobTasks will also install DistribJob (but not vice versa).

Testing your installation

DistribJob and DistribJobTasks include a set of tests.  You are advised to run these before proceeding.

Basic local test

Once you have completed the installation, the first test is:

   % distribjob $HOME/test/DistribJob/input/controller_local.prop 1 2

 

This test should work right out of the box if you installed your test/ directory into $HOME.  Otherwise, edit test/DistribJob/input/controller_local.prop.  Change the values of properties that refer to files in the test/ directory to reflect the actual location of test/.

 

The test should produce output that looks like this:

 

[sfischer@pythia ~]$ distribjob test/DistribJob/input/controller_local.prop 1 2

Reading properties from test/DistribJob/input/controller_local.prop

Reading properties from /home/sfischer/test/DistribJob/input/task.prop

Finding input set size

Input set size is 30

Subtask set size is 3 (10 subtasks)

Initializing server...

 

Initializing node 1...

Initializing node 2...

.

dispatching subTask 1 to node 1.1

 

dispatching subTask 2 to node 2.1

.

subTask 2 failed

 

dispatching subTask 3 to node 2.1

......

subTask 1 succeeded

 

dispatching subTask 4 to node 1.1

 

subTask 3 succeeded

 

 

subTask 9 succeeded

Cleaning up nodes...

 

Failure: 2 subtasks failed

Please look in /home/sfischer/test/master/failures/*/result

After analyzing and correcting failures:

  1. mv /home/sfischer/test/master/failures /home/sfischer/test/master/failures.save

  2. set restart=yes in test/DistribJob/input/controller_local.prop

  3. restart the job

 

The test purposely includes two subtasks that fail.  This will give you the chance to practice correcting failures and restarting.  The source of the problem is that two of the lines in test/DistribJob/input/inputset (which is the input for this job) have “oops” on them.  To correct the problem:

Basic liniac test

This test is exclusively for UPenn’s Liniac.  To run it:

% cd $HOME/test

% liniacsubmit 2 2 $HOME/test/DistribJob/input/controller_liniac.prop

The immediate output is the message confirming entry to the queue.

Once the queue runs the job, its output (identical to that from the local test) will appear in $HOME/test/xxxxx where xxxxx is the log file starting with the queue’s job id.

DistribJobTasks tests

The bioinformatics tasks included in the DistribJobTasks module also have tests.  Run these in a similar fashion to the basic tests.  You may need to edit the task.prop files to specify where your bioinformatics resources are.

Specifying your task

You provide a task to DistribJob (see configuration below).  The task determines:

 

Chose from the built-in tasks provided by the DistribJobTasks module, or code your own task.

Coding your own task (advanced)

To code your own task, you will need to write a subclass of DistribJob::Task.  For samples, see:

 

You will also need to define the command that will run your subtasks on the nodes.  The command must:

Constraints on input

Following are constraints that apply to your task’s input:

Using built-in nodes

DistribJob distributes the subtasks to nodes (ie, machines in a cluster).  You will specify how many compute slots each node has when you configure DistribJob (discussed below).  The DistribJob package includes a growing number of built in types of node including:

The LocalNode can be run on any multi-processor or even single-processor machine, where it is efficient to have more than one subtask running at a time.

Coding your own node (guru)

If your cluster uses a process control system other than one of these, you can still use DistribJob, but you need to write some simple code.  DistribJob::Node is the object which represents a node.  Its main purpose is to specify how to communicate between the server and node.  The details of particular types of nodes are specified by subclasses of DistribJob::Node, which is what you will need to write.  To learn how, use DistribJob::BprocNode and DistribJob::LocalNode as samples.

 

You may also want to help your user by providing a cluster-specific startup script.  This script will submit a job to your cluster’s queue, and then call distribjob (see Running below) when the job is ready to run.  As a sample see DistribJob/bin/liniacsubmit which submits a job to UPenn’s Liniac cluster.  Also see DistribJob/bin/liniacjob, which is the script that runs.  It in turn calls distribjob.

Setting up to run

The first step is to decide where you want your input directory, and where you want your master directory.  Typically, these will go in a directory dedicated to this run of your task.  Just make sure there is enough storage space to accommodate your results.

As an example, lets say you chose to run in $HOME/myrun, and that your task is creating a BLAST matrix (one of the provided bioinformatics tasks).  Create your input directory, and copy the input file to it (in this case, a set of DNA sequences).

% mkdir –r $HOME/myrun/input

% cp myseqs.fsa $HOME/myrun/input

Configuring

Create two configuration files for your task (the standard place for them is your input directory):

Other resources

You may need to make other resources available to your task.  For example, the provided BlastMatrixTask and BlastSimilarityTask make use of a database of sequences to BLAST against.  This file may be very large, and so, you will not want to copy it into your input directory.  In this case, you just specify the resource’s location using the appropriate property in your task.prop file.

Running

The distribjob command starts the controller.  Running it with no arguments prints its usage.  Running it with –help prints its full help display.

 

You use different commands to run locally or to run on different clusters.

 

Regardless of how you run, the result of all the subtasks will be merged into master/mainresult.

Running locally

If you use DistribJob::LocalNode as your node type, you are running locally (ie, on your local server).  You do not need to submit your job to a cluster queue; you just run distribjob directly.  To do so, use this command (if, for example, you want to distribute across 3 virtual nodes):

% distribjob your_controller.prop 1 2 3

Running on UPenn’s Liniac cluster

You are assumed to know how to use UPenn’s Liniac cluster.  To run distribjob on the Liniac, set the nodeClass property in your controller.prop file to DistribJob::SgeNode.  Rather than calling distribjob directly, submit your distributed job to the Liniac’s queue by calling liniacsubmit.  Here is its usage:

% liniacsubmit nodecount minutes controllerPropFileFullPath

 

liniacsubmit produces as immediate output the standard queue submission report.  If you want to run on 20 nodes and estimate your job will take 10 hours, use this:

% cd where_I_want_my_log/

% liniacsubmit 20 600 /my_inputdir/controller.prop

 

When the queue runs your job, it will place the job’s log in the directory where you ran liniacsubmit.  Check this log to see your job progress.

 

To see the status of your job on the queue, run:

% showq

Running on a different type of cluster

If you plan on running on a cluster type other than UPenn’s Liniac, you or your administrator will need to provide commands that parallel liniacsubmit and liniacjob.

Handling problems

These are the kinds of problems you may encounter:

False starts

These are errors in which the job immediately fails, and no work is dispatched to the nodes.  The most common reason is that you have an error in a configuration file. The log should explain the problem although it might be a little less-than-obvious.

 

Once you have corrected the problem, you need to delete your master directory, and start again.

Failures

When a subtask running on a node fails, the log will report the failures and the job’s files are copied to a directory in master/failures/subtask_nnn/result.  Look carefully in all the files to determine the cause of the failure.

 

If the problem appears to be one that will happen reproducibly, then you need to correct the source of the problem.  For example, you may need to correct your input file (but don’t delete or add elements… this will mess up the indexing used by the controller).  Or, you may need to provide missing data or executable files.

 

If the problem seems like a random flux of the cosmos, then you can defer correcting the source of the problem.

 

After you have handled all the failures, and when your job is no longer running, delete the master/failures/ directory and restart the job (see below).

Hung subtasks

The directory master/running contains subdirectories for each running subtask.  Hung subtasks will have subdirectories there whose subtask number should have long since come and gone.

 

If you detect a hung subtask, you will need to kill your job and restart (see below).  You can either wait till the rest of the subtasks are complete or, if you feel that the hung subtask(s) is(are) using resources that you would rather have working for you, you can kill forthwith.

 

 

Need to kill

You may need to kill your job, either because you have hung subtasks, or because it turns out to be a virus bent on conquering the world.  There are two steps you need to take:

  1. kill the distribjob controller:
  1. kill it as soon as possible (without corrupting its results):

                                                        i.      % distribjob controllerPropFileFullPath –kill

  1. kill it without interrupting running subtasks:

                                                        i.      % distribjob controllerPropFileFullPath –killslow

  1. If you are running in a queue, kill the job in the queue.  (On UPenn’s Liniac, use the canceljob command).

Restarting

To restart a job, change the restart property in the controller.prop file to yes, and start the job the same way you did the first time.  Subtasks that are already complete will be skipped.  (This is controlled by the file master/ completedSubtasks.log, which is a list of completed subtasks.  If you need to redo a subtask that has already completed –at your own risk-- then you can delete its number from the list. But, remember its results might already be merged into the main result).