Published using Google Docs
ReFlow User's Manual
Updated automatically every 5 minutes

ReFlow User’s Manual

About this document

Requirements and restrictions

Installation

Basic concepts

ReFlow Features

Graph Features

Compile time validation

Conditional execution of steps and subgraphs

Global steps

External dependencies

Subgraph references

Runtime Features

Restartability

Undo

Test mode

Offline steps

Email Alerts

Failure recovery

Load throttling

Changing the graph after a workflow has begun

Code evolution after a workflow has begun.

Dependencies on the GUS system

Information flow in the graph

Running modes

Subgraphs: advantages and disadvantages

Creating a workflow

Writing step classes

Creating a workflow home directory

Setting up config files

Files needed before you can compile a graph

The workflow.prop file

The root.params file

The steps.prop file

The stepsGlobal.prop file

The stepsShared.prop file

The loadBalance.prop file

Files required before you can test a workflow

The initOfflineSteps file

The initStopAfterSteps file

Files required before you can run a workflow for real

The alerts file

Graph XML files

Constructing a graph file

Input parameters

Constants

Steps

Calling subgraphs

Subgraph references

IncludeIfXmlFileExists

Compiling a graph

Types of errors

Global graphs

Design pattern: use a nested data directory

ReFlow command line tools

Testing a workflow

Resetting a test workflow

Testing email alerts

Running a workflow

Load balancing

Taking steps offline

Handling step failure

Undo

Visualizing a workflow

Port forwarding

Windows

UNIX/Mac

GUS dependencies or assumptions

Questions

Notes

DatasetLoader Steps

Displaying Datasources on a web site

Testing the syntax of a resources XML file

About this document

This document describes the ReFlow workflow system.  ReFlow was developed by the VEuPathDB project as an in-house workflow system.  However it may be suited to use by others.  In addition to general ReFlow documentation, some particulars of VEuPathDB's use of ReFlow are documented here.  They serve as primary documentation for in-house users, and as hints for external users.

This document does not cover the Dataset Classes system.  That system is a layer on top of ReFlow that auto-generates workflow graphs, based on datasets added to the system.  This document is about ReFlow itself.  Even if you plan to use the Dataset Classes system, it is likely you will need to understand the details of ReFlow described here.

Requirements and restrictions

There are basic requirements for running ReFlow, not including compute and file system facilities that are proportional to the scale of the data you plan to process:

There are restrictions that might affect non-VEuPathDB users of ReFlow, and that could be refactored if there is demand:

Installation

To install ReFlow please see the ReFlow Installation Guide.

Basic concepts

ReFlow is a simple workflow system based on a dependency graph.  It runs on UNIX only, and has been tested on Linux.  Its primary user interface is textual (command line and log files).  While it does not have a GUI, its textual tools do a good job managing very large workflows.

ReFlow is specifically designed for use in populating and maintaining integrating databases (warehouses). Its key feature, and how it got its name, is that it is a reversible workflow.  Any step in the graph can be undone, and when it is, ReFlow runs the graph in reverse to that point, erasing from the database, or other persistent stores, any consequences of that step and its children.

In overview:

The ReFlow Sample graph (in the ReFlow Graph Viewer) is a simple example.

The OrthoMCL-DB graph (also shown in the ReFlow Graph Viewer) is a good real world example.  To see the root graph, in the left panel expand generated -> OrthoMCL and click on orthomclFull.  

The PlasmoDB graph is a big workflow.  To see the root graph, in the left panel expand generated -> PlasmoDB and click on project.  

ReFlow Features

ReFlow has a number of features that make it well suited to running large, long running workflows that populate a database.   Once a database is built the ReFlow graph can continue to maintain the database as datasets are added, removed or refreshed.

Graph Features

Compile time validation

At compile time ReFlow validates that all:

ReFlow is careful about graph correctness.  This is important when graphs get large.

Conditional execution of steps and subgraphs

Each graph is passed a set of parameter values which are accessed throughout the graph file as variables.  Some of these parameter values might be useful to control the shape of the graph.  (The parameter values are resolved at compilation time.)  The <step> and <subgraph> elements each have optional includeIf= and excludeIf= attributes.   These take as values expressions that evaluate to either true or false.  If includeIf evaluates to true (false) the step or subgraph will be included (excluded) in the graph, and the opposite holds for excludeIf.   Logical expressions are allowed (they must be valid JavaScript). The exampleGraph.xml sample file has an example of this.

The shape of the graph can be changed at runtime as well.  The <step> and <subgraph> elements each have an optional skipIfFile= attribute.  This attribute takes a file name as a value.  If the file exists the step or subgraph will be skipped.  It is assumed that a depended step has the role of writing (or not) this file. The exampleRootGraph.xml sample file has an example of this.

This feature should be used rarely and with caution as it disrupts the declarative nature of the graph, and can risk untraceable complexity (i.e. spaghetti).

A ReFlow graph does not support programming structures such as iteration.

Global steps

Databases maintained by ReFlow typically load some commonly used resources, for example controlled vocabularies like the Gene Ontology.  Steps scattered around the graph might depend on one or more of these.  The natural way to encode this in a graph is to put the common resources in a subgraph near the root.  All other subgraphs, and the steps they contain, will depend on them.  However, this natural design has an undesired property.  All steps below them in the graph depend on the common resources, not just those that actually need the resource.  Over time these common resources might become out of date.   If they are at the top of the graph when that happens, then most of the graph must be undone, which is very costly.

ReFlow works around this, in a carefully limited way, by allowing steps to be declared as global. Loading a dataset as a global step lets other steps anywhere in the graph declare a direct dependency to it.  This is similar to using a global variable in a programming language.  When a global step is undone only steps with a direct dependency on it are undone.

These too should be used with caution.

External dependencies

In rare cases it helps to have a step depend directly on a step that is not in its subgraph and is not a global step.  You can think of it as a step reaching a long arm across the subgraph hierarchy to depend on a faraway step. The parent is declared as available for an external dependency, and the child step is declared to have an external dependency to it.   This is a workaround, similar to global steps.  It should be used rarely and carefully.   To understand better why it might be needed as an escape hatch, see the section below Subgraphs: advantages and disadvantages.

Subgraph references

Typically a subgraph call statically specifies the XML graph file to call (passing it parameters).  This is analogous to calling a subroutine in a programming language.   Occasionally it is useful to dynamically vary the subgraph that is called, depending on the context in which the calling graph itself was called.  In this case the calling graph might be passed as a parameter the name of the subgraph file to call.  This is a subgraph reference. It is analogous to a subroutine reference in a programming language.  In like fashion, the parameter signature of any graph called must match the parameters passed.

Runtime Features

Restartability

A ReFlow workflow can be stopped and started.  Stopping the controller prevents any new steps from starting, but allows running steps to complete.  The controller can be restarted at any time; the workflow will resume where it left off..  (ReFlow will prevent you from starting multiple controllers for one workflow.)   In urgent situations steps can be killed (using UNIX kill).  When a graph is restarted its details are compared against a record (in a database) of what has been run already to confirm that the they are compatible.

Undo

If a step in the workflow is discovered to be incorrect or if its data needs updating, that step can be undone.  The workflow first must be stopped, and all running steps either completed or reset to the ready state.  The workflow is then run in undo mode, to undo the step.  Undo removes all consequences of the step, including persistent results.  After undo is complete the workflow can be restarted going forward.

Test mode

A workflow can, and should, be run in test mode before being put into production.    Test mode descends the graph in the same way real mode does, but the steps do not do any work.  When the controller calls a step it passes a flag indicating which mode is being run.  In test mode external commands and database loaders are not run (but their command lines are logged for debugging).  For each step, it tests the following:

Offline steps

Sometimes parts of the graph are under development or don’t have complete data yet. These steps can be taken offline, so the workflow will not attempt to run them.  Taking a step offline prevents it and all of its children from running.  Offline steps can easily be put back online.  Steps can be taken offline at workflow startup by placing their name in a configuration file; they can also be taken offline while the workflow is running.

Email Alerts

ReFlow can send mail when a step completes, if that step is registered for an email alert.  This is useful if someone would like to use the data that has been loaded.  It is also useful if a step requires manual intervention, for example manual editing of meta data terms in the database.  (This should be kept to a minimum.)    In the latter case the trick is to take the children of the step offline at workflow startup.  After the appropriate person receives the email and performs the intervention, the children can be put back online.  

Failure recovery

Because ReFlow executes steps in parallel, and because graphs can get large, at a given moment there may be many steps running, completing and starting, plus some (hopefully not many) failing.  The controller log gives a running account of these state changes.  A failed step halts itself and prevents its children from running, but the rest of the flow continues.  ReFlow supports failure recovery by providing command line tools to easily find failed steps and the detailed logging of each step for diagnosis.  A well written step will provide instructions in the log on how to clean up partial results and hints at how to correct the problem.  Once the cleanup and corrections are done the step can be set back to the ready state and the workflow will try to run it again.

Load throttling

Steps in the workflow use different types of system resources, such as the local file server, an application database or a compute cluster.  Because the workflow runs in parallel, and because there may be thousands of steps, it would be easy for an unthrottled workflow to potentially overwhelm one or more of these resources.  (Some types of resources, such as licensed bioinformatics software, might enforce single use at a time.)  Steps in the workflow can be tagged with a load type, and the workflow can be configured to limit a given load type to a specified number of steps.  For example, steps could be tagged with “database” and that resource could be throttled to allow only 10 such steps at a time.  There are no constraints on tag names, they can be whatever you want, but any that are used must be configured with a throttle limit.

Changing the graph after a workflow has begun

ReFlow graphs can be large and run for a long time; they can also be used to manage a database indefinitely.  In this environment the graph may need to change if the datasets loaded into the database must evolve.  In ReFlow you may change any part of the graph that has not yet run. You may add steps, remove steps or change steps (and subgraphs).  You may not change parts of the graph that have run.  To do so, those steps must first be undone.

Code evolution after a workflow has begun.

One of the strengths of ReFlow is its ability to manage a database over a long period of time.  There is a challenge associated with this strength:  dealing with code evolution in step classes and in subgraphs.  Say the workflow has loaded many large RNASeq datasets.  Months later the developers encounter a new type of RNASeq dataset that requires slightly different processing.  They want might want to upgrade the RNASeq subgraph to include a new step to be used only in the new type of dataset, or modify an existing step to know about this new type of data.  To do so they must stop the workflow, upgrade the code and restart.  Steps that have already been completed must not be harmed by the code changes while steps that have yet to run must be able to take advantage of it.

Two features of ReFlow support this.  The first is that subgraphs may declare new parameters.  Completed subgraphs leave a record in the workflow database indicating what parameters they were run with.  If those values change on restart ReFlow throws an error.  However this error checking is relaxed in the case of new parameters.   No error is thrown if a subgraph was previously run with fewer parameters than the subgraph now requires.   The second is that step classes may declare optional parameters.  This has the advantage that newly required parameters (such as a flag to handle a new type of RNASeq data) can be added.  It has the disadvantage of imperfect error checking.  On balance it is needed to allow for code evolution.

Dependencies on the GUS system

In its current implementation ReFlow utilizes parts of the GUS system.  These are not deep dependencies.  If there was demand, these could be factored or worked around.  The points of dependency are:

Information flow in the graph

Information is shared across steps in several ways:

In addition, the Step class super class provides a handle on these two property files (see Files needed before compiling a graph):

Running modes

ReFlow supports two running modes:

Either one of those modes can run forward or reverse (undo).

Subgraphs: advantages and disadvantages

Subgraphs help factor a large graph into reusable pieces.  Like all reusable pieces in software they help us comprehend a complex structure. A root graph with good subgraph factoring is relatively easy to understand.  If all the steps were in-line we would not see the forest from the trees.  

But in ReFlow there is a cost to using subgraphs, having to do with undo.  If any step inside a subgraph needs to be undone then all steps that depend on the subgraph must also be undone.  Consider the simple graph below.

The bottom step depends on the subgraph represented by the box.  Imagine this bottom step doesn’t have a real dependency on the left step inside the box.  If this were written in-line it would only depend on the right step.  But because that left step is inside the subgraph the bottom step now depends on it.   If the left step, or any step inside the box, is undone, the bottom step must be too.  This is usually ok, unless the bottom step is expensive, such that you only want to undo it if really necessary.

In sum there is occasionally a trade-off to using subgraphs.  They allow reuse and abstraction but can sometimes incur an unnecessary undo cost.  In real life graphs you sometimes need to pull steps, such as the left step above, out of a subgraph they might naturally belong in if factoring were the only consideration.

Creating a workflow

Creating a ReFlow workflow is an iterative process.  The pieces of the puzzle that must come together are:

  1. writing step classes in Perl.  These are reusable modules that do work.  Any type of task you want done is wrapped in a step class.  For example, you might have a step class that runs BLAST.
  2. writing the graph in XML.  The graph has steps that depend on each other and can also call subgraphs.  Each step is a parameterized call to a step class. Constructing the graph typically includes:
  1. Writing low-level graphs that are reusable modules.
  2. Writing higher-level graphs that may call the lower level ones
  3. Writing a root graph that is the top-level of the graph nesting.
  1. Setting up a home directory for a run of the workflow.  This includes writing a number of configuration files.
  2. Compiling the graph
  3. Running the graph in test mode
  4. Finally, after you have iterated on the above process, and completed development of the graph, running it for real.

Writing step classes

Step classes are written in Perl and are subclasses of WorkflowStepHandle.  Every step in a graph calls one step class.  

Here is a sample HelloWorld.pm step class that prints a message passed to it.  The FindOrthomclPairs.pm step class is a simple example of a real step class.

It may be that you already have an existing library of step classes, in which case you only need to write step classes for new tasks that are not in the library.

Step classes do the following:

Creating a workflow home directory

The workflow runs out of a home directory.  Make a directory that will be your workflow’s home.  One pattern that might work for you is:

workflows/

        my_site/

                my_sites_version/

where my_site is the name of the flow and my_sites_version is its version.

In this document we refer to your workflow home directory as my_workflow_home.

Setting up config files

The config files live in my_workflow_home/config.  To get started, make that directory.

This directory contains much valuable information.  For production workflows you should consider having it under version control (SVN, Git, etc).  If you do, be sure that the repository is secure, as these files may contain sensitive information such as login/password info and file paths

Files needed before you can compile a graph

The workflow.prop file

This file provides the workflow controller its most basic configuration information.  It uses standard property file syntax.

The root.params file

This file contains values that are passed in as parameter values to the root graph XML file.  It uses standard property file syntax.

The steps.prop file

Use properties in this file to provide values to steps that are contingent on the environment in which the workflow is running.  Values in this file are available to step classes when a particular step or type of step runs.   An example would be the location in the file system of a particular executable program run by a step.  You may have a second similar file, with similar purpose, named with the name of the workflow:  workflow_name.prop.

The stepsGlobal.prop file

Use properties in this file to provide values needed by global steps and that are contingent on the environment in which the workflow is running.

The stepsShared.prop file

Use properties in this file to provide values that are generally needed by step classes and that are contingent on the environment in which the workflow is running. Values in this file are available to three types of files:

The dataset class xml files may also use the following macros, whose values are supplied by the workflow when it runs:

The loadBalance.prop file

This file controls load balancing during the run of the workflow.  In the workflow graph xml files, different steps may be given different stepLoadTypes= values.  These are arbitrary (single word) tags that designate a step as exerting one or more type of load on the hardware and software resources used by the workflow.  For example, stepLoadTypes= might be set to “computeCluster” or “database” to indicate that the step uses those resources.  The value be a comma delimited list such as “sequenceTable,featureTable”.

In loadBalance.prop put the names of those tags, and give each a numeric value to indicate how many of those steps can run at one time.  This way you balance the load.   You are required to have at least one line in the file indicating the total number of steps that can run at one time, like this:

total=10

If you have more tags, give each its own line:

total=14

database=8

computeCluster=9

sequenceTable=4

featureTable=5

Files required before you can test a workflow

For both the files below, you can use % as a wildcard in step names.

The initOfflineSteps file

A file containing a list of step names (one per line).  These files will be set to OFFLINE the next time the workflow starts up.  This is identical to calling the workflowstep command with the -f option.  The reason to use this init file is to prevent the workflow from running them before you have a chance to run workflowstep.  To put these steps back ONLINE, use the workflowstep command.  You can use the -f option and point it to this file.  Also remove the steps from this file.  NOTE: this file must exist, even if it is empty.  Do not remove the whole file.

The initStopAfterSteps file

   (steps to pause after (e.g. to inspect DB state, etc.))

Note: if you want to stop after a subgraph is complete be sure to put the stop after on the return.  For example:  global.NRDB_RSRC.return

Files required before you can run a workflow for real

The alerts file

This file controls email alerts sent when steps are done.  (They are not sent when a step fails or in test or undo mode).  The format is two columns, tab delimited, and there can be one or more rows.  The first column is a perl regex (do not include the /'s) that will find the full name of a step or steps to send an alert for; the second column is a comma delimited list of email adresses.  Here is an example that sends an alert on all Pvivax steps that end in makeDataDir:

pvivax.+makeDataDir$        joe@blotto.com, sue@flamers.org

See the section Testing Email Alerts to learn how to test this file before running with it.

Graph XML files

Constructing a graph file

Graph files are written in XML.  They contain steps and calls to subgraphs.  Once you understand how to construct a single file you can construct many, with some calling others as subgraphs.  (Cyclic calls are not allowed.)

Input parameters

Most graphs begin with a declaration of input parameters.  These are analogous to the arguments of a function.  Use the <param> element to declare each input parameter.  Doing so forces any call to this graph (as a root graph or subgraph) to provide values for all the declared input parameters.  Within the graph file, the values are visible and are referred to like this:  $$param_name$$.  These can be used anywhere within any XML attribute or element value.  See exampleGraph.xml for an example.

Constants

After the section that declares parameters you can optionally use the <constant> element to declare constants.  Use these for convenience or good “coding practice.”  See exampleRootGraph.xml for an example.

Steps

Add a step to the graph with a <step> element.  The element calls a step class, passing it a set of parameter values, to execute a unit of work.  

The XML attributes of a <step> element are:

Attribute

Required

Description

name

yes

The name of the step.  Must be unique within this graph file, but may be non-unique with respect to other graph files.   (A step’s full name is the path formed by the names of the subgraph calls leading to the step, plus the step’s local name as the basename.)  The step name may be used by other steps to declare a dependency on this step.  Also used for logging, reporting and managing this step.

stepclass

yes

The name of the step class to call.   Must be a fully qualified Perl package name of a subclass of ReFlow::Controller::WorkflowStepHandle.

externalName

no

An additional name for this step, if this step will be referred to by the <dependsExternal> element (see below).  This name must be unique across the entire graph, not just this graph file.  See External Dependencies and see exampleGraph.xml for an example.

groupName

no

Assign steps to a group in this graph file.  Used for visual grouping in the ReFlow Graph Viewer.

includeIf or excludeIf

no

Include or exclude this step in the graph, based on how the value of this attribute evaluates.  The value may be any valid JavaScript expression that evaluates to true or false.  Like any value in the graph XML file, this value can contain a graph parameter value.   If this step is excluded from the graph its dependencies are instead added as dependencies to all of this step’s parents.  See exampleGraph.xml for an example.

skipIfFile

no

The full path of a file.  If the file exists, skip this step.  This is ReFlow’s only dynamic conditional.  (The file can be written by a preceding step anywhere in the graph to force this step to be skipped.)  The logic is similar to includeIf and excludeIf.  See exampleRootGraph.xml for an example.

stepLoadTypes

no

Use this attribute to assign arbitrary tags to this step.   The tags identify the step as one or more “load type.”   These tags must be mentioned in the loadBalance.prop configuration file.  In that file, each load type is given a throttle level.  Only that many steps of this load type are allowed to run at a time.  See testComputeCluster.xml for an example.

undoRoot

no

Use this attribute to restrict how this step is undone.  The undoRoot value must be the name of a step in this graph that is a parent of this step.  This step can only be undone as part of an undo that contains that step.  Some steps cannot actually be undone cleanly, particularly if they perform updates.

.

Elements contained within a Step (all are zero-or-more):

paramValue

Use this element to pass a parameter value to the step class.  

depends

Use this element to declare that this step depends on another step above it in the graph XML file.  Use that step’s local name as a value.

dependsExternal

Use this element to declare a dependency on a step that is external to this graph file.  That step must use the externalName attribute.  Use that name as the value for this element.  See External Dependencies and see anotherExampleGraph.xml for an example.  

dependsGlobal

Use this element to declare a dependency on a global step.  Use that step’s name as a value.  See Global steps.

Calling subgraphs

A subgraph call is similar to a step except that you use a <subgraph> element instead of a <step> element.  The <subgraph> element has an xmlFilename= attribute.  This is the graph that will be called. Like all graph XML files, it will have declared input parameters.  The <subgraph> element must use <paramValue> elements to provide values for every one of those parameters.  Here is an example graph XML file that calls a subgraph.

Subgraph references

In a <subgraph> element, if the value of the xmlFilename= attribute is a variable, then this <subgraph> element is a "subgraph reference."  This is analogous to a method reference in a programming language.  The pattern of its use is to embed in a standard graph a call to a graph that varies depending upon the context in which the standard graph is called.  (It is often used to call graphs that are generated by the graph templating system.)  It is particularly powerful if elements of the standard graph depend on the subgraph reference.  In this case, using a subgraph reference is critical.   If the subgraph reference might point to a non-existent graph, use the excludeIfXmlFileDoesNotExist=true attribute.  Here is an example graph XML file that uses a subgraph reference.

IncludeIfXmlFileExists

Coming soon...

Compiling a graph

Compiling a graph checks that:

To compile your graph::

$ workflow -h /files/cbil/data/cbil/TrypDB/wftest -c

An alternative way to do this, which also provides a detailed list of all the steps in the graph is:

$ workflowXml -h /files/cbil/data/cbil/TrypDB/wftest

Types of errors

In the following example, a variable has a problem.  Go to the step mentioned and see what is wrong with the variable.

$ workflow -h /files/cbil/data/cbil/TrypDB/wftest -c 

Parameter 'genomeExtDbRlsSpecList' in step 'common.InsertGenegenomicsequenceWithSql' includes an unresolvable variable reference: '$$genomeExtDbRlsSpecList$$' 

In this example, the step is calling a subgraph that expects the genomeExtDbRlsSpecList parameter, but the caller is not providing a value for it

$ workflow -h /files/cbil/data/cbil/TrypDB/wftest -c
Graph "compilation" failed.  The following subgraph parameter values are missing:

  File giardiadb/giardiaWorkflow.xml:
      step: common
          > genomeExtDbRlsSpecList 

Global graphs

Use a global graph to make steps visible for dependency to steps anywhere in the workflow. 

A typical application would be a step that many steps throughout the graph depends on --and which may need to be undone regularly for iterative build-- but which cannot be local to their subgraph because they are in reused subgraphs.  (If the global step were in that subgraph it would be executed multiple times, each time the graph is reused.) 

Any step in the workflow may use <dependsGlobal> to declare a direct dependency on any step in a global graph, with these rules:


In the root graph, a subgraph may be declared to be global by using <globalSubgraph> instead of <subgraph>, with these rules:

Design pattern: use a nested data directory

Files produced and consumed by steps are stored in a common directory structure.  A preferred design is for its structure to mirror the nesting of subgraphs.  This is accomplished by:

Steps within the same graph can use their graph’s $dataDir variable to refer to each others’ files.  A best practice is to use a constant for these file names to avoid file name mistakes

The root of this tree is in
my_workflow_home/data.  The WorkflowStepInvoker superclass makes this available to all steps, via the getDataDir() method, which is relative to the root.  File and directory names passed to step classes from the graph are relative to that directory.

ReFlow command line tools

Used while running a workflow:

workflow

Start the controller, ie, start running a workflow

workflowStopController

Stop the controller.  Steps will continue running, but no new ones will start

workflowstep

Change the state of a step, eg, from FAILED to READY

workflowForceStepState

Force a step's state to DONE or FAILED.  Use with caution.

workflowSummary

Get a brief summary of the state of a workflow

workflowUndoMgr

Run a series of undos.

workflowStepTimes

Get a report of the running times of completed steps, sorted by time.

workflowRestoreBackup

Restore a workflow graph from a workflow backup.

workflowIllegalGraphReport

Get a report of illegal graph changes.  This is useful when you update the

graph of an existing workflow.

Used while developing a workflow:

workflowXml

Parse a ReFlow graph from XML files and print in serialized format

workflowAlertsTest

Test the validity of the email alerts config file.

workflowlog2gbrowse

Display a section of controller.log as intervals in gbrowse.,to see timing information graphically

Used by the DatasetClasses system to generate workflow:

workflowDataset2DatasetLoaders

Create a datasetLoader xml file for a set of datasets

workflowDataset2Graphs

Create graph files for a set of datasets, from graph template files (in $GUS_HOME)

workflowDataset2PropsFile

Create a simple properties file from a dataset file for consumption by the DatasetPresenter system.

Used to create and display a graph as html:

workflowHtmlGenerator

Generate html to view a graph.

workflowGraphServer

Start an html server to show a ReFlow graph; an option to use instead of python.

workflowMakeDotFile

Obsolete, replaced by workflowHtmlGenerator

Used internally by the controller:

workflowRunStep

Run a step.

workflowMakeBackups

Make a backup of critical workflow running dirs

Testing a workflow

Once your workflow compiles, run it in test mode.  The purpose of test mode is to exercise the step classes.  It also gets you familiar with running a workflow and using the tools available.  Running a workflow in test mode is very similar to running in real mode.

For each step, it tests the following:

Before you run in test mode:

To run in test mode, use this command:

$ workflow -h /files/cbil/data/cbil/TrypDB/wftest -t

Notes on test mode

  1. when running a workflow you may find it convenient to have four terminal windows or tabs open, each one named as follows (or similarly):
  1. controller: a window which runs the workflow command (in the foreground)
  1. it is best to run this command in screen to avoid unexpected termination.  (Termination won’t harm the workflow but will slow you down)
  1. log: a window which runs tail -f my_workflow_home/logs/controller.log so you can watch the progress of the workflow
  2. steps: a window in which you have cd’d to my_workflow dir and in which you can easily run workflow -h my_workflow_home -s1 FAILED to locate failed steps.  Once you have fixed any steps, use the workflowstep command to set them to ready
  3. files: a window which is cd’d to the directories that contain your graph and step class files, and in which you fix failed steps
  1. messages in the controller log about Exclude indicate that steps are excluded from the graph.  Either that step has excludeIf=true or includeIf=false
  2. ignore this error
  1. could not find ParserDetails.ini in /usr/lib/perl5/site_perl/5.8.5/XML/SAX
  1. to handle initial failures:
  1. they are likely to be systemic, ie, errors in stepsShared.prop
  1. if you fix property files, you need to change the individual FAILED step to READY or reset the whole test flow (see Resetting a workflow)
  1. if you change the graph.xml file (eg, correct a step class name), you need to bld and then restart the controller
  1. what happens if I kill the engine while my test is running?
  1. steps that are running continue to run safely
  2. they successfully update the workflow engine database
  3. just restart your workflow!
  1. instead of grepping the controller.log for FAILED steps, use this command (the log may contain old info):
  1. workflow -h my_workflow_home -s FAILED 
  1. once you fix a step, you can change its status from FAILED to READY with this command.  (Use the full step name path):
  1. workflowstep -h my_workflow_home -p step_name ready 
  1. if you made a fix that will correct a set of steps you can set them all to ready by using a pattern for the step name.  use % as a wildcard.  The following example finds any step with Nrdb anywhere in its full step name path
  1. workflowstep -h my_workflow_home -p %Nrdb% ready
  2. it is ok if your pattern finds steps that are not FAILED.  The output of the workflowstep command will show you some warnings like this, which you are free to ignore:
  1. Warning: Can't change PbergheiPostLoadGenome.genomeAnalysis.blastxGenomicSeqsNrdb from 'DONE' to 'READY'
  1. To find out if the workflow is still processing steps, or if it is blocked by FAILED steps, run this command to find ON_DECK steps.  (If there are none, then it is stalled, and you must fix FAILED steps to give it steps to process):
  1. workflow -h my_workflow_home -s ON_DECK 
  1. if fixing a FAILED step involves correcting the step's graph XML, then you will need to restart the controller so it can pick up the new XML files (don't forget to do a bld first).  Just kill the controller and then restart it.
  2. if fixing a FAILED step involves correcting a step that the FAILED step depends on, then you will need to UNDO the dependee.
  3. how to run the controller in the background
  4. checking steps
  5. monitoring controller log

Resetting a test workflow

You can blow away your test workflow by resetting it.

workflow -h workflow_home_dir -reset

  1. reset the workflow, if there is any old one of this name in there
  1. you can reset a test workflow (but not a real one)
  2. CAUTION: this will WIPE OUT your test workflow's home dir (not config/) and the workflow tables
  3. might need to manually clean up cluster dirs
  4. might need to manually clean up download dirs

Testing email alerts

If you provide the file my_workflow_home/config/alerts (see the Configuration section above), the workflow will send email alerts when the specified steps are done.

Errors in this file will prevent the alerts from being sent.  If you want to add alerts to the file, test it first:

  1. make a temporary alerts file with the alert or alerts you want to test
  2. use the workflowAlertsTest command to test that file.  It will show you which steps will generate an alert
  3. append the tested alerts to the real alerts file in the config/ dir

Running a workflow

Running a workflow in real mode is similar to running in test mode.  The main difference is that the workflow does real work, rather than pretend to do it.  The operative differences are...

Load balancing

Coming soon…

Taking steps offline

To prevent a step from running, set the OFFLINE property. While workflow is running, use the workflowstep command

 workflowstep -h workflow_home_dir -p stepname_pattern [offline|online]

Note this will have no effect when workflow is not running. When workflow is not running (before it is run), add the step name to config/initOfflineSteps as specified above.

Handling step failure

Step failure will be logged in logs/controller.log and in step/[step_name]/step.err.

  1. Clean up: Instructions may be found in step.err, such as:
    Since this plugin step FAILED, please CLEAN UP THE DATABASE by calling:
     ga GUS::Community::Plugin::Undo --plugin ClinEpiData::Load::Plugin::InsertInvestigations --workflowContext --commit --algInvocationId PLUGIN_ALG_INV_ID_HERE

    Find the number for “AlgInvocationId” in the log; if there is none, then it is probably safe to skip this. Some steps do not require any cleanup.
  2. Manually correct the problem if necessary -- a programming issue, file permissions, missing data, network issues, etc.
  1. You may need to undo a previous step before proceeding. You will have to stop workflow using the command workflowStopController. This will stop the workflow; note there may be some processes started by the workflow that may continue to run if they are not affected by the failed step.
  1. Set the step state to READY using the workflowstep command. Workflow (if running) will run the step again.

Kill a running step

If you discover a problem in a step while it is running, it is usually preferable to let it finish or fail; alternatively, you can kill it and any of its child processes.

Undo

Coming soon...

Visualizing a workflow

While authoring a workflow simply involves editing the workflow XML files in a workflow directory, it may be handy to see the workflow steps and subgraphs in a graphical, linkable format along the way.  The Authoring GUI provides this functionality.  

To run the authoring GUI and see your workflow in graph format:

If you change your workflow graphs in $GUS_HOME run workflowHtmlGenerator again and reload your browser page

Port forwarding

This example uses port 8087.  However, you should probably use a different port, as 8087 might be in use.  Choose a port that you think will not be used by others.  

If you want to use a local browser (eg, on your laptop), you need to use port forwarding.  To do so, choose a port that is not in use on your local machine and forward it to the port you chose on the server.

Windows

Here is an example in SecureCRT (Windows machines):

UNIX/Mac

If you are using OS X or Linux, you can set up the port forwarding as a part of the ssh command you use to log in to the server. For example, if you wanted to forward port 8087 on cassini to your laptop, run

ssh -L 8087:localhost:8087 my_login@cassini.pcbi.upenn.edu

GUS dependencies or assumptions

Questions

Notes

DatasetLoader Steps

A set of steps in the workflow called resources acquire data from outside the workflow and make it available to other steps in the flow either as files or in the database.  Or they may simply load the data into the database for use after the workflow is complete.  The design pattern for a resource step is:

Displaying Datasources on a web site

When the workflow loads a resource it records minimal information describing it:  only its name and version.  The <resource> element contains extensive provenance and descriptive information.  This is stored in the ApiDB.DataSource table (and its friends).  The information is placed there by the Tuning Manager, which reads the resources XML file from the repository.  It only includes information for resources that are found in SRes.ExternalDatabase.name.  This allows the resources XML file to accrue new resources after the flow is complete without those being seen on the website.

Testing the syntax of a resources XML file

Use the validateResourceXml program to compare your xml against the RNG schema definition ApiCommonData/Load/lib/rng/resources.rng.

Here is its usage:

$ validateResourceXml

usage: validateResourceXml -f resources_xml_file