DistribJob is a distributed job controller: it distributes elements of an input set to nodes in a compute cluster for processing, and merges the result.
More specifically, the controller runs on a server. It breaks a large input set into smaller sets called subtasks, and assigns the subtasks to compute slots on nodes (machines) as they become available. On the node, it runs a command on the subtask’s input, and merges the result into the main result on the server. It robustly tracks failures and can be restarted.
DistribJob does not do scheduling or load balancing. For a particular job, it is given a static set of nodes, each with a static number of slots. When a slot is vacated, DistribJob fills it with the next subtask.
DistribJob is written in Perl, using Perl objects and modules.
Caveat: DistribJob is academic software. It has been extensively used and tested, and is robust. However, its installation, documentation and usage are not brought to the level of industry polish. For example, some of the steps below forego thorough handholding, expecting instead for you to open a Perl file here or a test file there. None of the steps are hard. Just rough-cut.
The DistribJob software is available for download from the GUS SVN repository. You will also need need the CBIL libraries also available from the GUS SVN repository.
Use the GUS Installer to install the DistribJob and CBIL software.
DistribJob comes in two modules: the DistribJob module itself, which is the core, and, the optional DistribJobTasks module, which includes a set of predefined bioinformatics tasks (“tasks” are defined below). The latter are useful as samples, even if the particular tasks included are not relevant to your application.
To install either one, acquire the distribjob package (either through downloading or cvs). Then follow the instructions in the DistribJob/readme.txt file or DistribJobTasks/readme.txt. Installing DistribJobTasks will also install DistribJob (but not vice versa).
DistribJob and DistribJobTasks include a set of tests. You are advised to run these before proceeding.
Once you have completed the installation, the first test is:
% distribjob $HOME/test/DistribJob/input/controller_local.prop 1 2
This test should work right out of the box if you installed your test/ directory into $HOME. Otherwise, edit test/DistribJob/input/controller_local.prop. Change the values of properties that refer to files in the test/ directory to reflect the actual location of test/.
The test should produce output that looks like this:
[sfischer@pythia ~]$ distribjob test/DistribJob/input/controller_local.prop 1 2
Reading properties from test/DistribJob/input/controller_local.prop
Reading properties from /home/sfischer/test/DistribJob/input/task.prop
Finding input set size
Input set size is 30
Subtask set size is 3 (10 subtasks)
Initializing server...
Initializing node 1...
Initializing node 2...
.
dispatching subTask 1 to node 1.1
dispatching subTask 2 to node 2.1
.
subTask 2 failed
dispatching subTask 3 to node 2.1
......
subTask 1 succeeded
dispatching subTask 4 to node 1.1
subTask 3 succeeded
…
subTask 9 succeeded
Cleaning up nodes...
Failure: 2 subtasks failed
Please look in /home/sfischer/test/master/failures/*/result
After analyzing and correcting failures:
1. mv /home/sfischer/test/master/failures /home/sfischer/test/master/failures.save
2. set restart=yes in test/DistribJob/input/controller_local.prop
3. restart the job
The test purposely includes two subtasks that fail. This will give you the chance to practice correcting failures and restarting. The source of the problem is that two of the lines in test/DistribJob/input/inputset (which is the input for this job) have “oops” on them. To correct the problem:
This test is exclusively for UPenn’s Liniac. To run it:
% cd $HOME/test
% liniacsubmit 2 2 $HOME/test/DistribJob/input/controller_liniac.prop
The immediate output is the message confirming entry to the queue.
Once the queue runs the job, its output (identical to that from the local test) will appear in $HOME/test/xxxxx where xxxxx is the log file starting with the queue’s job id.
The bioinformatics tasks included in the DistribJobTasks module also have tests. Run these in a similar fashion to the basic tests. You may need to edit the task.prop files to specify where your bioinformatics resources are.
You provide a task to DistribJob (see configuration below). The task determines:
Chose from the built-in tasks provided by the DistribJobTasks module, or code your own task.
To code your own task, you will need to write a subclass of DistribJob::Task. For samples, see:
You will also need to define the command that will run your subtasks on the nodes. The command must:
Following are constraints that apply to your task’s input:
DistribJob distributes the subtasks to nodes (ie, machines in a cluster). You will specify how many compute slots each node has when you configure DistribJob (discussed below). The DistribJob package includes a growing number of built in types of node including:
The LocalNode can be run on any multi-processor or even single-processor machine, where it is efficient to have more than one subtask running at a time.
If your cluster uses a process control system other than one of these, you can still use DistribJob, but you need to write some simple code. DistribJob::Node is the object which represents a node. Its main purpose is to specify how to communicate between the server and node. The details of particular types of nodes are specified by subclasses of DistribJob::Node, which is what you will need to write. To learn how, use DistribJob::BprocNode and DistribJob::LocalNode as samples.
You may also want to help your user by providing a cluster-specific startup script. This script will submit a job to your cluster’s queue, and then call distribjob (see Running below) when the job is ready to run. As a sample see DistribJob/bin/liniacsubmit which submits a job to UPenn’s Liniac cluster. Also see DistribJob/bin/liniacjob, which is the script that runs. It in turn calls distribjob.
The first step is to decide where you want your input directory, and where you want your master directory. Typically, these will go in a directory dedicated to this run of your task. Just make sure there is enough storage space to accommodate your results.
As an example, lets say you chose to run in $HOME/myrun, and that your task is creating a BLAST matrix (one of the provided bioinformatics tasks). Create your input directory, and copy the input file to it (in this case, a set of DNA sequences).
% mkdir –r $HOME/myrun/input
% cp myseqs.fsa $HOME/myrun/input
Create two configuration files for your task (the standard place for them is your input directory):
You may need to make other resources available to your task. For example, the provided BlastMatrixTask and BlastSimilarityTask make use of a database of sequences to BLAST against. This file may be very large, and so, you will not want to copy it into your input directory. In this case, you just specify the resource’s location using the appropriate property in your task.prop file.
The distribjob command starts the controller. Running it with no arguments prints its usage. Running it with –help prints its full help display.
You use different commands to run locally or to run on different clusters.
Regardless of how you run, the result of all the subtasks will be merged into master/mainresult.
If you use DistribJob::LocalNode as your node type, you are running locally (ie, on your local server). You do not need to submit your job to a cluster queue; you just run distribjob directly. To do so, use this command (if, for example, you want to distribute across 3 virtual nodes):
% distribjob your_controller.prop 1 2 3
You are assumed to know how to use UPenn’s Liniac cluster. To run distribjob on the Liniac, set the nodeClass property in your controller.prop file to DistribJob::SgeNode. Rather than calling distribjob directly, submit your distributed job to the Liniac’s queue by calling liniacsubmit. Here is its usage:
% liniacsubmit nodecount minutes controllerPropFileFullPath
liniacsubmit produces as immediate output the standard queue submission report. If you want to run on 20 nodes and estimate your job will take 10 hours, use this:
% cd where_I_want_my_log/
% liniacsubmit 20 600 /my_inputdir/controller.prop
When the queue runs your job, it will place the job’s log in the directory where you ran liniacsubmit. Check this log to see your job progress.
To see the status of your job on the queue, run:
% showq
If you plan on running on a cluster type other than UPenn’s Liniac, you or your administrator will need to provide commands that parallel liniacsubmit and liniacjob.
These are the kinds of problems you may encounter:
These are errors in which the job immediately fails, and no work is dispatched to the nodes. The most common reason is that you have an error in a configuration file. The log should explain the problem although it might be a little less-than-obvious.
Once you have corrected the problem, you need to delete your master directory, and start again.
When a subtask running on a node fails, the log will report the failures and the job’s files are copied to a directory in master/failures/subtask_nnn/result. Look carefully in all the files to determine the cause of the failure.
If the problem appears to be one that will happen reproducibly, then you need to correct the source of the problem. For example, you may need to correct your input file (but don’t delete or add elements… this will mess up the indexing used by the controller). Or, you may need to provide missing data or executable files.
If the problem seems like a random flux of the cosmos, then you can defer correcting the source of the problem.
After you have handled all the failures, and when your job is no longer running, delete the master/failures/ directory and restart the job (see below).
The directory master/running contains subdirectories for each running subtask. Hung subtasks will have subdirectories there whose subtask number should have long since come and gone.
If you detect a hung subtask, you will need to kill your job and restart (see below). You can either wait till the rest of the subtasks are complete or, if you feel that the hung subtask(s) is(are) using resources that you would rather have working for you, you can kill forthwith.
You may need to kill your job, either because you have hung subtasks, or because it turns out to be a virus bent on conquering the world. There are two steps you need to take:
i. % distribjob controllerPropFileFullPath –kill
i. % distribjob controllerPropFileFullPath –killslow
To restart a job, change the restart property in the controller.prop file to yes, and start the job the same way you did the first time. Subtasks that are already complete will be skipped. (This is controlled by the file master/ completedSubtasks.log, which is a list of completed subtasks. If you need to redo a subtask that has already completed –at your own risk-- then you can delete its number from the list. But, remember its results might already be merged into the main result).