This document describes the ReFlow workflow system. ReFlow was developed by the VEuPathDB project as an in-house workflow system. However it may be suited to use by others. In addition to general ReFlow documentation, some particulars of VEuPathDB's use of ReFlow are documented here. They serve as primary documentation for in-house users, and as hints for external users.

This document does not cover the Dataset Classes system. That system is a layer on top of ReFlow that auto-generates workflow graphs, based on datasets added to the system. This document is about ReFlow itself. Even if you plan to use the Dataset Classes system, it is likely you will need to understand the details of ReFlow described here.

Requirements and restrictions

There are basic requirements for running ReFlow, not including compute and file system facilities that are proportional to the scale of the data you plan to process:

UNIX
Java, Perl
a relational database for use by the ReFlow controller
persistent application data must be identifiable at the workflow step level, for undo

relational database rows must have a tracking ID that can be mapped to ReFlow step IDs
this typically means that rows are written as inserts, not updates. Updating rows significantly complicates undoing in the database.

compute cluster jobs must be monitorable:

make available a (remote) process id
write a done file when complete

ReFlow is installed and runs in the GUS environment (see the ReFlow Installation Guide)
ReFlow mirrors application data to your compute cluster master file system. Once a particular workflow runs using a particular compute cluster file system, if you want to change to a different compute cluster you would need to mirror that file structure to the new compute cluster file system.

There are restrictions that might affect non-VEuPathDB users of ReFlow, and that could be refactored if there is demand:

support for Perl step classes only
the controller has only been tested with Oracle
there is a hard-coded dependency on the DistribJob compute controller

Installation

To install ReFlow please see the ReFlow Installation Guide.

Basic concepts

ReFlow is a simple workflow system based on a dependency graph. It runs on UNIX only, and has been tested on Linux. Its primary user interface is textual (command line and log files). While it does not have a GUI, its textual tools do a good job managing very large workflows.

ReFlow is specifically designed for use in populating and maintaining integrating databases (warehouses). Its key feature, and how it got its name, is that it is a reversible workflow. Any step in the graph can be undone, and when it is, ReFlow runs the graph in reverse to that point, erasing from the database, or other persistent stores, any consequences of that step and its children.

In overview:

A ReFlow workflow is a directed acyclic graph

The nodes are Steps
The edges are dependencies. A step is run when all the steps it depends on are done
The ReFlow controller (Java) reads the graph into memory and executes it. The graph might take from seconds to weeks to complete.

Steps

execute an atom of work, such as:

creating a shared directory
acquiring data from the internet
loading data into a database
running a local analysis program
starting a job on a compute cluster

are an instance of a step class
declare dependencies on other steps
are configured with parameter values

Step classes

each perform a particular type (or class) of work
typically are light-weight wrappers that call external programs to do the real work
are parameterized. When an instance of a step class is invoked by the controller, it receives a package of key-value pairs from the graph, much like a subroutine is given argument values.
can do any kind of work that is compatible with the WorkflowStepHandle API. In general, this is launching one or a series of processes on the controller machine that will give a correct exit status code on completion.
are written in Perl, and are a subclass of ReFlow::Controller::WorkflowStepHandle.

an example step class is ApiCommonWorkflow::Main::WorkflowSteps::OrthomclMakeGoodProteins
many more examples are in the full set of VEuPathDB step classes

could, in theory, be written in any language, as long as the functionality in the Perl step class super class is ported. Doing this might require a small change to the controller.

A graph

is specified in a graph XML file

the XML schema is defined in RNG
VEuPathDB’s genome.xml graph is an example

calls one or more Steps in the order specified by their dependencies
declares input parameters
declares constants
defines a local scope in which steps can

depend on other steps
have access to graph input parameters
have access to graph constants

specifies parameter values to be passed to each step
may call subgraphs

these are similar to Steps but invoke a nested graph instead of a step class instance
the nesting of subgraphs forms a containment hierarchy that is orthogonal to the dependency graph

may be called as a root or subgraph
may be called multiple times within the full graph
may be thought of as a subroutine

The Root Graph

is the top of a graph containment hierarchy.
is no different than any other graph, except that it

is the top of the hierarchy in this case
is passed its input parameter values from the rootParams.prop file instead of a calling graph
is allowed to contain a call to the special subgraph <globalSubgraph> (see Global Steps below).

The Full Graph

is the dependency graph obtained by expanding all the nested subgraphs.

The ReFlow Sample graph (in the ReFlow Graph Viewer) is a simple example.

To see the root graph click on exampleRootGraph.
In the viewer the subgraphs are clickable.
The title of each graph is a link to the xml file for that graph.

The OrthoMCL-DB graph (also shown in the ReFlow Graph Viewer) is a good real world example. To see the root graph, in the left panel expand generated -> OrthoMCL and click on orthomclFull.

The PlasmoDB graph is a big workflow. To see the root graph, in the left panel expand generated -> PlasmoDB and click on project.

ReFlow Features

ReFlow has a number of features that make it well suited to running large, long running workflows that populate a database. Once a database is built the ReFlow graph can continue to maintain the database as datasets are added, removed or refreshed.

Graph Features

Compile time validation

At compile time ReFlow validates that all:

graph files conform to the XML schema
dependencies are valid
subgraphs exist
subgraph calls provide a complete set of parameter values (satisfying the parameter declaration of the called graph)
variables are resolvable (parameter values and constants).

ReFlow is careful about graph correctness. This is important when graphs get large.

Conditional execution of steps and subgraphs

Each graph is passed a set of parameter values which are accessed throughout the graph file as variables. Some of these parameter values might be useful to control the shape of the graph. (The parameter values are resolved at compilation time.) The <step> and <subgraph> elements each have optional includeIf= and excludeIf= attributes. These take as values expressions that evaluate to either true or false. If includeIf evaluates to true (false) the step or subgraph will be included (excluded) in the graph, and the opposite holds for excludeIf. Logical expressions are allowed (they must be valid JavaScript). The exampleGraph.xml sample file has an example of this.

The shape of the graph can be changed at runtime as well. The <step> and <subgraph> elements each have an optional skipIfFile= attribute. This attribute takes a file name as a value. If the file exists the step or subgraph will be skipped. It is assumed that a depended step has the role of writing (or not) this file. The exampleRootGraph.xml sample file has an example of this.

This feature should be used rarely and with caution as it disrupts the declarative nature of the graph, and can risk untraceable complexity (i.e. spaghetti).

A ReFlow graph does not support programming structures such as iteration.

Global steps

Databases maintained by ReFlow typically load some commonly used resources, for example controlled vocabularies like the Gene Ontology. Steps scattered around the graph might depend on one or more of these. The natural way to encode this in a graph is to put the common resources in a subgraph near the root. All other subgraphs, and the steps they contain, will depend on them. However, this natural design has an undesired property. All steps below them in the graph depend on the common resources, not just those that actually need the resource. Over time these common resources might become out of date. If they are at the top of the graph when that happens, then most of the graph must be undone, which is very costly.

ReFlow works around this, in a carefully limited way, by allowing steps to be declared as global. Loading a dataset as a global step lets other steps anywhere in the graph declare a direct dependency to it. This is similar to using a global variable in a programming language. When a global step is undone only steps with a direct dependency on it are undone.

These too should be used with caution.

External dependencies

In rare cases it helps to have a step depend directly on a step that is not in its subgraph and is not a global step. You can think of it as a step reaching a long arm across the subgraph hierarchy to depend on a faraway step. The parent is declared as available for an external dependency, and the child step is declared to have an external dependency to it. This is a workaround, similar to global steps. It should be used rarely and carefully. To understand better why it might be needed as an escape hatch, see the section below Subgraphs: advantages and disadvantages.

Subgraph references

Typically a subgraph call statically specifies the XML graph file to call (passing it parameters). This is analogous to calling a subroutine in a programming language. Occasionally it is useful to dynamically vary the subgraph that is called, depending on the context in which the calling graph itself was called. In this case the calling graph might be passed as a parameter the name of the subgraph file to call. This is a subgraph reference. It is analogous to a subroutine reference in a programming language. In like fashion, the parameter signature of any graph called must match the parameters passed.

Runtime Features

Restartability

A ReFlow workflow can be stopped and started. Stopping the controller prevents any new steps from starting, but allows running steps to complete. The controller can be restarted at any time; the workflow will resume where it left off.. (ReFlow will prevent you from starting multiple controllers for one workflow.) In urgent situations steps can be killed (using UNIX kill). When a graph is restarted its details are compared against a record (in a database) of what has been run already to confirm that the they are compatible.

Undo

If a step in the workflow is discovered to be incorrect or if its data needs updating, that step can be undone. The workflow first must be stopped, and all running steps either completed or reset to the ready state. The workflow is then run in undo mode, to undo the step. Undo removes all consequences of the step, including persistent results. After undo is complete the workflow can be restarted going forward.

Test mode

A workflow can, and should, be run in test mode before being put into production. Test mode descends the graph in the same way real mode does, but the steps do not do any work. When the controller calls a step it passes a flag indicating which mode is being run. In test mode external commands and database loaders are not run (but their command lines are logged for debugging). For each step, it tests the following:

that the step class exists
that the graph is passing to the step class all expected parameters
that the properties expected by the step class exist in the steps.prop or stepsShared.prop files
that input files expected by the step class exist, ie, that upstream steps properly created them. (The files created in test mode are empty, but they have the correct file name.)

Offline steps

Sometimes parts of the graph are under development or don’t have complete data yet. These steps can be taken offline, so the workflow will not attempt to run them. Taking a step offline prevents it and all of its children from running. Offline steps can easily be put back online. Steps can be taken offline at workflow startup by placing their name in a configuration file; they can also be taken offline while the workflow is running.

Email Alerts

ReFlow can send mail when a step completes, if that step is registered for an email alert. This is useful if someone would like to use the data that has been loaded. It is also useful if a step requires manual intervention, for example manual editing of meta data terms in the database. (This should be kept to a minimum.) In the latter case the trick is to take the children of the step offline at workflow startup. After the appropriate person receives the email and performs the intervention, the children can be put back online.

Failure recovery

Because ReFlow executes steps in parallel, and because graphs can get large, at a given moment there may be many steps running, completing and starting, plus some (hopefully not many) failing. The controller log gives a running account of these state changes. A failed step halts itself and prevents its children from running, but the rest of the flow continues. ReFlow supports failure recovery by providing command line tools to easily find failed steps and the detailed logging of each step for diagnosis. A well written step will provide instructions in the log on how to clean up partial results and hints at how to correct the problem. Once the cleanup and corrections are done the step can be set back to the ready state and the workflow will try to run it again.

Load throttling

Steps in the workflow use different types of system resources, such as the local file server, an application database or a compute cluster. Because the workflow runs in parallel, and because there may be thousands of steps, it would be easy for an unthrottled workflow to potentially overwhelm one or more of these resources. (Some types of resources, such as licensed bioinformatics software, might enforce single use at a time.) Steps in the workflow can be tagged with a load type, and the workflow can be configured to limit a given load type to a specified number of steps. For example, steps could be tagged with “database” and that resource could be throttled to allow only 10 such steps at a time. There are no constraints on tag names, they can be whatever you want, but any that are used must be configured with a throttle limit.

Changing the graph after a workflow has begun

ReFlow graphs can be large and run for a long time; they can also be used to manage a database indefinitely. In this environment the graph may need to change if the datasets loaded into the database must evolve. In ReFlow you may change any part of the graph that has not yet run. You may add steps, remove steps or change steps (and subgraphs). You may not change parts of the graph that have run. To do so, those steps must first be undone.

Code evolution after a workflow has begun.

One of the strengths of ReFlow is its ability to manage a database over a long period of time. There is a challenge associated with this strength: dealing with code evolution in step classes and in subgraphs. Say the workflow has loaded many large RNASeq datasets. Months later the developers encounter a new type of RNASeq dataset that requires slightly different processing. They want might want to upgrade the RNASeq subgraph to include a new step to be used only in the new type of dataset, or modify an existing step to know about this new type of data. To do so they must stop the workflow, upgrade the code and restart. Steps that have already been completed must not be harmed by the code changes while steps that have yet to run must be able to take advantage of it.

Two features of ReFlow support this. The first is that subgraphs may declare new parameters. Completed subgraphs leave a record in the workflow database indicating what parameters they were run with. If those values change on restart ReFlow throws an error. However this error checking is relaxed in the case of new parameters. No error is thrown if a subgraph was previously run with fewer parameters than the subgraph now requires. The second is that step classes may declare optional parameters. This has the advantage that newly required parameters (such as a flag to handle a new type of RNASeq data) can be added. It has the disadvantage of imperfect error checking. On balance it is needed to allow for code evolution.

Dependencies on the GUS system

In its current implementation ReFlow utilizes parts of the GUS system. These are not deep dependencies. If there was demand, these could be factored or worked around. The points of dependency are:

The GUS installer and execution environment.
The GUS schema’s row_algorithm_invocation_id column, which tracks in every row in every table what program (algorithm invocation) added this row to the database
The DistribJob compute cluster controller, which segments large input datasets for distribution across the cluster and provides node and application software failure management.
The GUS DatasetLoader XML and GUS plugins, which standardize the process of loading data into the database

Information flow in the graph

Information is shared across steps in several ways:

constants within the local scope of a graph
global constants
parameter values passed down through subgraph calls
files in a shared directory structure
the application database

In addition, the Step class super class provides a handle on these two property files (see Files needed before compiling a graph):

steps.prop
stepsShared.prop

Running modes

ReFlow supports two running modes:

test (see Testing a workflow)
real (see Running a workflow)

Either one of those modes can run forward or reverse (undo).

Subgraphs: advantages and disadvantages

Subgraphs help factor a large graph into reusable pieces. Like all reusable pieces in software they help us comprehend a complex structure. A root graph with good subgraph factoring is relatively easy to understand. If all the steps were in-line we would not see the forest from the trees.

But in ReFlow there is a cost to using subgraphs, having to do with undo. If any step inside a subgraph needs to be undone then all steps that depend on the subgraph must also be undone. Consider the simple graph below.

The bottom step depends on the subgraph represented by the box. Imagine this bottom step doesn’t have a real dependency on the left step inside the box. If this were written in-line it would only depend on the right step. But because that left step is inside the subgraph the bottom step now depends on it. If the left step, or any step inside the box, is undone, the bottom step must be too. This is usually ok, unless the bottom step is expensive, such that you only want to undo it if really necessary.

In sum there is occasionally a trade-off to using subgraphs. They allow reuse and abstraction but can sometimes incur an unnecessary undo cost. In real life graphs you sometimes need to pull steps, such as the left step above, out of a subgraph they might naturally belong in if factoring were the only consideration.

Creating a workflow

Creating a ReFlow workflow is an iterative process. The pieces of the puzzle that must come together are:

writing step classes in Perl. These are reusable modules that do work. Any type of task you want done is wrapped in a step class. For example, you might have a step class that runs BLAST.
writing the graph in XML. The graph has steps that depend on each other and can also call subgraphs. Each step is a parameterized call to a step class. Constructing the graph typically includes:

Writing low-level graphs that are reusable modules.
Writing higher-level graphs that may call the lower level ones
Writing a root graph that is the top-level of the graph nesting.

Setting up a home directory for a run of the workflow. This includes writing a number of configuration files.
Compiling the graph
Running the graph in test mode
Finally, after you have iterated on the above process, and completed development of the graph, running it for real.

Writing step classes

Step classes are written in Perl and are subclasses of WorkflowStepHandle. Every step in a graph calls one step class.

Here is a sample HelloWorld.pm step class that prints a message passed to it. The FindOrthomclPairs.pm step class is a simple example of a real step class.

It may be that you already have an existing library of step classes, in which case you only need to write step classes for new tasks that are not in the library.

Step classes do the following:

define an API for using the step class that includes:

required configuration properties
required input parameters (passed in from the graph)

some input parameters are required input files and provided output files. Taken together these parameters form the file API of the step class, i.e., what files are input and output by an instance of this step class.

get and validate needed configuration items from ReFlow property files. An example might be the location of a standard in-house resource.
get parameter values passed to the step by the graph, and validate their existence and correctness.
check that expected input files exist
use the configuration property and parameter values to format external commands to run that will do the actual work
when running in test mode:

log those commands without running them
simulate the output of those commands by creating empty output files as required by the file API. (In test mode the files that were input to this step class were also empty. Test mode only checks file existence not content.)

when running in real mode:

run those commands for real

support undo by knowing which commands to run, and files and database rows to delete when running in reverse. (An advantage to using the GUS system and GUS plugins to write to the database is that GUS supports undo in the database. If you do not use GUS your means of writing rows to the database must also be undoable.)

Creating a workflow home directory

The workflow runs out of a home directory. Make a directory that will be your workflow’s home. One pattern that might work for you is:

workflows/

my_site/

my_sites_version/

where my_site is the name of the flow and my_sites_version is its version.

In this document we refer to your workflow home directory as my_workflow_home.

Setting up config files

The config files live in my_workflow_home/config. To get started, make that directory.

This directory contains much valuable information. For production workflows you should consider having it under version control (SVN, Git, etc). If you do, be sure that the repository is secure, as these files may contain sensitive information such as login/password info and file paths

Files needed before you can compile a graph

The workflow.prop file

This file provides the workflow controller its most basic configuration information. It uses standard property file syntax.

name=

the name of your workflow. This can be an arbitrary name, but typically is the name of your project. It is only used internally by the workflow. Workflow steps do not have access to this value, so it does not have to agree with any project name that might be using.

version=

the version of your workflow. This can be an arbitrary name, but usually reflects the version of your project. For example if your database is version 5.3, this might be “5.3-test” or “5.3-prod.” It is not recommended for you to use “5.3” because you may have multiple workflows for a given version of your project, such as “test” and “production.” The name and version together must be unique across all workflows that will run on a single database instance

workflowXmlFile=

the root graph xml file

workflowTable=

the Workflow table you installed in your database as part of ReFlow installation

workflowStepTable=

the WorkflowStep table you installed in your database as part of ReFlow installation

workflowStepTrackingTable=

the table you installed in your database as part of ReFlow installation that tracks which rows in the application database were written to by workflow steps. In the GUS system, this would be ApiDB.WorkflowStepAlgInvocation

The root.params file

This file contains values that are passed in as parameter values to the root graph XML file. It uses standard property file syntax.

The steps.prop file

Use properties in this file to provide values to steps that are contingent on the environment in which the workflow is running. Values in this file are available to step classes when a particular step or type of step runs. An example would be the location in the file system of a particular executable program run by a step. You may have a second similar file, with similar purpose, named with the name of the workflow: workflow_name.prop.

The stepsGlobal.prop file

Use properties in this file to provide values needed by global steps and that are contingent on the environment in which the workflow is running.

The stepsShared.prop file

Use properties in this file to provide values that are generally needed by step classes and that are contingent on the environment in which the workflow is running. Values in this file are available to three types of files:

graph xml files, as macros using the @@property_name@@ syntax
dataset class xml files, as macros using the @@property_name@@ syntax
step classes, using the getSharedConfig() method

The dataset class xml files may also use the following macros, whose values are supplied by the workflow when it runs:

@@dataDir@@
@@gusHome@@

The loadBalance.prop file

This file controls load balancing during the run of the workflow. In the workflow graph xml files, different steps may be given different stepLoadTypes= values. These are arbitrary (single word) tags that designate a step as exerting one or more type of load on the hardware and software resources used by the workflow. For example, stepLoadTypes= might be set to “computeCluster” or “database” to indicate that the step uses those resources. The value be a comma delimited list such as “sequenceTable,featureTable”.

In loadBalance.prop put the names of those tags, and give each a numeric value to indicate how many of those steps can run at one time. This way you balance the load. You are required to have at least one line in the file indicating the total number of steps that can run at one time, like this:

total=10

If you have more tags, give each its own line:

total=14

database=8

computeCluster=9

sequenceTable=4

featureTable=5

Files required before you can test a workflow

For both the files below, you can use % as a wildcard in step names.

The initOfflineSteps file

A file containing a list of step names (one per line). These files will be set to OFFLINE the next time the workflow starts up. This is identical to calling the workflowstep command with the -f option. The reason to use this init file is to prevent the workflow from running them before you have a chance to run workflowstep. To put these steps back ONLINE, use the workflowstep command. You can use the -f option and point it to this file. Also remove the steps from this file. NOTE: this file must exist, even if it is empty. Do not remove the whole file.

The initStopAfterSteps file

(steps to pause after (e.g. to inspect DB state, etc.))

Note: if you want to stop after a subgraph is complete be sure to put the stop after on the return. For example: global.NRDB_RSRC.return

Files required before you can run a workflow for real

The alerts file

This file controls email alerts sent when steps are done. (They are not sent when a step fails or in test or undo mode). The format is two columns, tab delimited, and there can be one or more rows. The first column is a perl regex (do not include the /'s) that will find the full name of a step or steps to send an alert for; the second column is a comma delimited list of email adresses. Here is an example that sends an alert on all Pvivax steps that end in makeDataDir:

pvivax.+makeDataDir$ joe@blotto.com, sue@flamers.org

See the section Testing Email Alerts to learn how to test this file before running with it.

Graph XML files

Constructing a graph file

Graph files are written in XML. They contain steps and calls to subgraphs. Once you understand how to construct a single file you can construct many, with some calling others as subgraphs. (Cyclic calls are not allowed.)

Input parameters

Most graphs begin with a declaration of input parameters. These are analogous to the arguments of a function. Use the <param> element to declare each input parameter. Doing so forces any call to this graph (as a root graph or subgraph) to provide values for all the declared input parameters. Within the graph file, the values are visible and are referred to like this: $$param_name$$. These can be used anywhere within any XML attribute or element value. See exampleGraph.xml for an example.

Constants

After the section that declares parameters you can optionally use the <constant> element to declare constants. Use these for convenience or good “coding practice.” See exampleRootGraph.xml for an example.

Steps

Add a step to the graph with a <step> element. The element calls a step class, passing it a set of parameter values, to execute a unit of work.

The XML attributes of a <step> element are:

Attribute	Required	Description
name	yes	The name of the step. Must be unique within this graph file, but may be non-unique with respect to other graph files. (A step’s full name is the path formed by the names of the subgraph calls leading to the step, plus the step’s local name as the basename.) The step name may be used by other steps to declare a dependency on this step. Also used for logging, reporting and managing this step.
stepclass	yes	The name of the step class to call. Must be a fully qualified Perl package name of a subclass of ReFlow::Controller::WorkflowStepHandle.
externalName	no	An additional name for this step, if this step will be referred to by the <dependsExternal> element (see below). This name must be unique across the entire graph, not just this graph file. See External Dependencies and see exampleGraph.xml for an example.
groupName	no	Assign steps to a group in this graph file. Used for visual grouping in the ReFlow Graph Viewer.
includeIf or excludeIf	no	Include or exclude this step in the graph, based on how the value of this attribute evaluates. The value may be any valid JavaScript expression that evaluates to true or false. Like any value in the graph XML file, this value can contain a graph parameter value. If this step is excluded from the graph its dependencies are instead added as dependencies to all of this step’s parents. See exampleGraph.xml for an example.
skipIfFile	no	The full path of a file. If the file exists, skip this step. This is ReFlow’s only dynamic conditional. (The file can be written by a preceding step anywhere in the graph to force this step to be skipped.) The logic is similar to includeIf and excludeIf. See exampleRootGraph.xml for an example.
stepLoadTypes	no	Use this attribute to assign arbitrary tags to this step. The tags identify the step as one or more “load type.” These tags must be mentioned in the loadBalance.prop configuration file. In that file, each load type is given a throttle level. Only that many steps of this load type are allowed to run at a time. See testComputeCluster.xml for an example.
undoRoot	no	Use this attribute to restrict how this step is undone. The undoRoot value must be the name of a step in this graph that is a parent of this step. This step can only be undone as part of an undo that contains that step. Some steps cannot actually be undone cleanly, particularly if they perform updates.

Elements contained within a Step (all are zero-or-more):

paramValue	Use this element to pass a parameter value to the step class.
depends	Use this element to declare that this step depends on another step above it in the graph XML file. Use that step’s local name as a value.
dependsExternal	Use this element to declare a dependency on a step that is external to this graph file. That step must use the externalName attribute. Use that name as the value for this element. See External Dependencies and see anotherExampleGraph.xml for an example.
dependsGlobal	Use this element to declare a dependency on a global step. Use that step’s name as a value. See Global steps.

Calling subgraphs

A subgraph call is similar to a step except that you use a <subgraph> element instead of a <step> element. The <subgraph> element has an xmlFilename= attribute. This is the graph that will be called. Like all graph XML files, it will have declared input parameters. The <subgraph> element must use <paramValue> elements to provide values for every one of those parameters. Here is an example graph XML file that calls a subgraph.

Subgraph references

In a <subgraph> element, if the value of the xmlFilename= attribute is a variable, then this <subgraph> element is a "subgraph reference." This is analogous to a method reference in a programming language. The pattern of its use is to embed in a standard graph a call to a graph that varies depending upon the context in which the standard graph is called. (It is often used to call graphs that are generated by the graph templating system.) It is particularly powerful if elements of the standard graph depend on the subgraph reference. In this case, using a subgraph reference is critical. If the subgraph reference might point to a non-existent graph, use the excludeIfXmlFileDoesNotExist=true attribute. Here is an example graph XML file that uses a subgraph reference.

IncludeIfXmlFileExists

Coming soon...

Compiling a graph

Compiling a graph checks that:

the graph xml is valid XML
dependencies are correct
constants are correct
macros are correct
graph names are correct
variables are correct

To compile your graph::

$ workflow -h /files/cbil/data/cbil/TrypDB/wftest -c

An alternative way to do this, which also provides a detailed list of all the steps in the graph is:

$ workflowXml -h /files/cbil/data/cbil/TrypDB/wftest

Types of errors

In the following example, a variable has a problem. Go to the step mentioned and see what is wrong with the variable.

$ workflow -h /files/cbil/data/cbil/TrypDB/wftest -c

Parameter 'genomeExtDbRlsSpecList' in step 'common.InsertGenegenomicsequenceWithSql' includes an unresolvable variable reference: '$$genomeExtDbRlsSpecList$$'

In this example, the step is calling a subgraph that expects the genomeExtDbRlsSpecList parameter, but the caller is not providing a value for it

$ workflow -h /files/cbil/data/cbil/TrypDB/wftest -c
Graph "compilation" failed. The following subgraph parameter values are missing:

File giardiadb/giardiaWorkflow.xml:
step: common
> genomeExtDbRlsSpecList

Global graphs

Use a global graph to make steps visible for dependency to steps anywhere in the workflow.

A typical application would be a step that many steps throughout the graph depends on --and which may need to be undone regularly for iterative build-- but which cannot be local to their subgraph because they are in reused subgraphs. (If the global step were in that subgraph it would be executed multiple times, each time the graph is reused.)

Any step in the workflow may use <dependsGlobal> to declare a direct dependency on any step in a global graph, with these rules:

all <dependsGlobal> elements in a <step> or <subgraph> must come after any <depends> eleme
In <dependsGlobal name="xxxx"> xxxx must refer to the name of a step in the global graph

In the root graph, a subgraph may be declared to be global by using <globalSubgraph> instead of <subgraph>, with these rules:

<globalSubgraph> is only allowed in the root graph
there may be only one <globalSubgraph> and it must be the first step in the root graph.
Unlike the <subgraph> element, the <globalSubgraph> element does not support <depends> or <dependsGlobal>. That is, the step that calls the global subgraph may not have any dependencies. (Inside the global graph there may be dependencies, like any other graph.)
in the root graph, no step can depend on the <globalSubgraph>, ie, its name may not be referred to in a depends=
A global graph XML has the same rules as a regular graph, except that a global graph may include <globalConstant> elements, with the following rules:

While a <constant> is visible only within the local XML file, a <globalConstant> is visible throughout the workflow
Use global constants sparingly, as they can lead to spaghetti. Typically, only steps that have a <dependsGlobal> should refer to them
In a graph XML file, all <globalConstant> elements must come before all <constant> elements.
<constant>, <param> and <globalConstant> elements may embed variables referencing <globalConstant> elements.
but <globalConstants> may not reference <contstants>
it is strongly recommended that globalConstants are given names that begin with global_, eg, global_NRDBOutputFile. This way they will be recognizable where they are used.

Design pattern: use a nested data directory

Files produced and consumed by steps are stored in a common directory structure. A preferred design is for its structure to mirror the nesting of subgraphs. This is accomplished by:

each graph defining an input parameter called $parentDataDir. This is it’s parent’s data directory.
for the root graph, the value of $parentDataDir is provided by the rootParams.prop file. This is the root of the directory structure
each graph calls a makeDataDir step that creates a $dataDir that is a subdirectory of its $parentDataDir, named for this graph.
recursively, each graph passes its $dataDir to its subgraphs as their $ParentDataDir

Steps within the same graph can use their graph’s $dataDir variable to refer to each others’ files. A best practice is to use a constant for these file names to avoid file name mistakes

The root of this tree is in my_workflow_home/data. The WorkflowStepInvoker superclass makes this available to all steps, via the getDataDir() method, which is relative to the root. File and directory names passed to step classes from the graph are relative to that directory.

ReFlow command line tools

Used while running a workflow:

workflow	Start the controller, ie, start running a workflow
workflowStopController	Stop the controller. Steps will continue running, but no new ones will start
workflowstep	Change the state of a step, eg, from FAILED to READY
workflowForceStepState	Force a step's state to DONE or FAILED. Use with caution.
workflowSummary	Get a brief summary of the state of a workflow
workflowUndoMgr	Run a series of undos.
workflowStepTimes	Get a report of the running times of completed steps, sorted by time.
workflowRestoreBackup	Restore a workflow graph from a workflow backup.
workflowIllegalGraphReport	Get a report of illegal graph changes. This is useful when you update the graph of an existing workflow.

Used while developing a workflow:

workflowXml	Parse a ReFlow graph from XML files and print in serialized format
workflowAlertsTest	Test the validity of the email alerts config file.
workflowlog2gbrowse	Display a section of controller.log as intervals in gbrowse.,to see timing information graphically

Used by the DatasetClasses system to generate workflow:

workflowDataset2DatasetLoaders	Create a datasetLoader xml file for a set of datasets
workflowDataset2Graphs	Create graph files for a set of datasets, from graph template files (in $GUS_HOME)
workflowDataset2PropsFile	Create a simple properties file from a dataset file for consumption by the DatasetPresenter system.

Used to create and display a graph as html:

workflowHtmlGenerator	Generate html to view a graph.
workflowGraphServer	Start an html server to show a ReFlow graph; an option to use instead of python.
workflowMakeDotFile	Obsolete, replaced by workflowHtmlGenerator

Used internally by the controller:

workflowRunStep	Run a step.
workflowMakeBackups	Make a backup of critical workflow running dirs

Testing a workflow

Once your workflow compiles, run it in test mode. The purpose of test mode is to exercise the step classes. It also gets you familiar with running a workflow and using the tools available. Running a workflow in test mode is very similar to running in real mode.

For each step, it tests the following:

that the step class exists
that the graph is passing to the step class all expected parameters
that the properties expected by the step class exist in the steps.prop or stepsShared.prop files
that input files expected by the step class exist, ie, that upstream steps properly created them. (The files created in test mode are empty, but they have the correct file name.)

Before you run in test mode:

set up ReFlow config files

in workflow.prop, be sure the projectVersion property has some kind of test suffix, like “1.0-test.” This way you will be able to use “1.0” for your real workflow run.
in stepsShared.prop, do the same thing. This ensures that test data files the workflow deploys to download site directories and other locations keyed on projectName and projectVersion will not collide with your real data files.

set database connection information in the gus.config file

This connection will be used by the ReFlow controller to record the state of the workflow.
Running in test mode does NOT write to the application database
Modify $GUS_HOME/config/gus.config to contain the following database configuration values (you may have to request them from a system or db admin):

dbVendor: probably “Oracle”
dbiDsn: perl DB connection info
jdbcDsn: java DB connection info
databaseLogin: DB username
databasePassword: DB password

You do not have to fill in any other fields to complete workflow test

To run in test mode, use this command:

$ workflow -h /files/cbil/data/cbil/TrypDB/wftest -t

Notes on test mode

when running a workflow you may find it convenient to have four terminal windows or tabs open, each one named as follows (or similarly):

controller: a window which runs the workflow command (in the foreground)

it is best to run this command in screen to avoid unexpected termination. (Termination won’t harm the workflow but will slow you down)

log: a window which runs tail -f my_workflow_home/logs/controller.log so you can watch the progress of the workflow
steps: a window in which you have cd’d to my_workflow dir and in which you can easily run workflow -h my_workflow_home -s1 FAILED to locate failed steps. Once you have fixed any steps, use the workflowstep command to set them to ready
files: a window which is cd’d to the directories that contain your graph and step class files, and in which you fix failed steps

messages in the controller log about Exclude indicate that steps are excluded from the graph. Either that step has excludeIf=true or includeIf=false
ignore this error

could not find ParserDetails.ini in /usr/lib/perl5/site_perl/5.8.5/XML/SAX

to handle initial failures:

they are likely to be systemic, ie, errors in stepsShared.prop

if you fix property files, you need to change the individual FAILED step to READY or reset the whole test flow (see Resetting a workflow)

if you change the graph.xml file (eg, correct a step class name), you need to bld and then restart the controller

what happens if I kill the engine while my test is running?

steps that are running continue to run safely
they successfully update the workflow engine database
just restart your workflow!

instead of grepping the controller.log for FAILED steps, use this command (the log may contain old info):

workflow -h my_workflow_home -s FAILED

once you fix a step, you can change its status from FAILED to READY with this command. (Use the full step name path):

workflowstep -h my_workflow_home -p step_name ready

if you made a fix that will correct a set of steps you can set them all to ready by using a pattern for the step name. use % as a wildcard. The following example finds any step with Nrdb anywhere in its full step name path

workflowstep -h my_workflow_home -p %Nrdb% ready
it is ok if your pattern finds steps that are not FAILED. The output of the workflowstep command will show you some warnings like this, which you are free to ignore:

Warning: Can't change PbergheiPostLoadGenome.genomeAnalysis.blastxGenomicSeqsNrdb from 'DONE' to 'READY'

To find out if the workflow is still processing steps, or if it is blocked by FAILED steps, run this command to find ON_DECK steps. (If there are none, then it is stalled, and you must fix FAILED steps to give it steps to process):

workflow -h my_workflow_home -s ON_DECK

if fixing a FAILED step involves correcting the step's graph XML, then you will need to restart the controller so it can pick up the new XML files (don't forget to do a bld first). Just kill the controller and then restart it.
if fixing a FAILED step involves correcting a step that the FAILED step depends on, then you will need to UNDO the dependee.
how to run the controller in the background
checking steps
monitoring controller log

Resetting a test workflow

You can blow away your test workflow by resetting it.

workflow -h workflow_home_dir -reset

reset the workflow, if there is any old one of this name in there

you can reset a test workflow (but not a real one)
CAUTION: this will WIPE OUT your test workflow's home dir (not config/) and the workflow tables
might need to manually clean up cluster dirs
might need to manually clean up download dirs

Testing email alerts

If you provide the file my_workflow_home/config/alerts (see the Configuration section above), the workflow will send email alerts when the specified steps are done.

Errors in this file will prevent the alerts from being sent. If you want to add alerts to the file, test it first:

make a temporary alerts file with the alert or alerts you want to test
use the workflowAlertsTest command to test that file. It will show you which steps will generate an alert
append the tested alerts to the real alerts file in the config/ dir

Running a workflow

Running a workflow in real mode is similar to running in test mode. The main difference is that the workflow does real work, rather than pretend to do it. The operative differences are...

Load balancing

Coming soon…

Taking steps offline

To prevent a step from running, set the OFFLINE property. While workflow is running, use the workflowstep command

workflowstep -h workflow_home_dir -p stepname_pattern [offline|online]

Note this will have no effect when workflow is not running. When workflow is not running (before it is run), add the step name to config/initOfflineSteps as specified above.

Handling step failure

Step failure will be logged in logs/controller.log and in step/[step_name]/step.err.

Clean up: Instructions may be found in step.err, such as:
Since this plugin step FAILED, please CLEAN UP THE DATABASE by calling:
ga GUS::Community::Plugin::Undo --plugin ClinEpiData::Load::Plugin::InsertInvestigations --workflowContext --commit --algInvocationId PLUGIN_ALG_INV_ID_HERE
Find the number for “AlgInvocationId” in the log; if there is none, then it is probably safe to skip this. Some steps do not require any cleanup.
Manually correct the problem if necessary -- a programming issue, file permissions, missing data, network issues, etc.

You may need to undo a previous step before proceeding. You will have to stop workflow using the command workflowStopController. This will stop the workflow; note there may be some processes started by the workflow that may continue to run if they are not affected by the failed step.

Set the step state to READY using the workflowstep command. Workflow (if running) will run the step again.

Kill a running step

If you discover a problem in a step while it is running, it is usually preferable to let it finish or fail; alternatively, you can kill it and any of its child processes.

Option 1: let it finish or fail

if expected to fail, handle as described above
if it might finish without error, use the workflowstep command to set the stopafter property, or
stop the workflow with workflowStopController

Option 2: kill the step and its child processes

Run workflowstep -h workflow_dir -p stepname kill. This will put the step into FAILED state, but it will not halt any processes initiated by the step; they must be identified and manually killed.

Undo

Coming soon...

Visualizing a workflow

While authoring a workflow simply involves editing the workflow XML files in a workflow directory, it may be handy to see the workflow steps and subgraphs in a graphical, linkable format along the way. The Authoring GUI provides this functionality.

To run the authoring GUI and see your workflow in graph format:

One time only:

choose a port

pick a random number between 8080 and ???
call that XXXX

set up port forwarding (see below)

Generate the HTML:

cd $GUS_HOME/lib/xml/workflow
workflowHtmlGenerator -r . .html

Start a webserver:

go to a new terminal window or tab
cd $GUS_HOME/lib/xml/workflow/.html

python -m SimpleHTTPServer XXXX

On your laptop open a browser to http://localhost:XXXX

If you change your workflow graphs in $GUS_HOME run workflowHtmlGenerator again and reload your browser page

Port forwarding

This example uses port 8087. However, you should probably use a different port, as 8087 might be in use. Choose a port that you think will not be used by others.

If you want to use a local browser (eg, on your laptop), you need to use port forwarding. To do so, choose a port that is not in use on your local machine and forward it to the port you chose on the server.

Windows

Here is an example in SecureCRT (Windows machines):

UNIX/Mac

If you are using OS X or Linux, you can set up the port forwarding as a part of the ssh command you use to log in to the server. For example, if you wanted to forward port 8087 on cassini to your laptop, run

ssh -L 8087:localhost:8087 my_login@cassini.pcbi.upenn.edu

GUS dependencies or assumptions

in GUS/Workflow

WorkflowStepInvoker

move to WorkflowStep.pm

runPlugin
getUndoPlugin
getAlvInvIds

in ApiCommonWorkflow

in Workflow.pm

move to WorkflowStepInvoker

getWorkflowDataDir
testInputFile
getInputFiles
getDataSource
DataSource.pm

resource acquisition

resource XML
manual delivery
plugins
external_database

Questions

question: how do i pass files from one subgraph to another?

answer: the most recent ancestor of the subgraphs defines a <constant> that will be the name of the file. it is passed as a param to the graph that produces the file and as a param to the graph that consumes it

question: how to define optional subgraph references? (e.g., global, isolatesResources)

answer: use includeIf/excludeIf. Note: for an excluded subgraph, the xml file name does not need to be a valid file name

question: best solution to pass extDblsSpec from resource to analysis steps? (e.g., getAndAnalyzeChIps.xml, chipExtDbRlsSpec)

answer:

what does this error mean: /home/weili1/gusApps/workflow/gus_home/lib/xml/workflow/plasmodb/plasmoWorkflow.xml:62:57: error: element "globalSubgraph" not allowed in this context
Validation failed: file:/home/weili1/gusApps/workflow/gus_home/lib/xml/workflow/plasmodb/plasmoWorkflow.xml
if i get this error, how do i find the problem:

Step 'PreichenowiSpecificWorkflow.mapIsolatesToGenome.blastnIsolateSeqGenomicSeqs.loadSimilarities' has changed in the XML file while in the state 'DONE'
old name:              PreichenowiSpecificWorkflow.mapIsolatesToGenome.blastnIsolateSeqGenomicSeqs.loadSimilarities
old params digest:     35360f49d5bfa4a6557c724a2a96d4f4
old depends string:    [mirrorFromCluster]
old class name:        ApiCommonWorkflow::Main::WorkflowSteps::InsertBlastSimilarities

new name:              PreichenowiSpecificWorkflow.mapIsolatesToGenome.blastnIsolateSeqGenomicSeqs.loadSimilarities
new params digest:     8ac0d35a2280f17067e7fbaa6ccaa33c
new depends string:    [mirrorFromCluster]
new class name:        ApiCommonWorkflow::Main::WorkflowSteps::InsertBlastSimilarities
look in step.err at the arguments to workflowstepwrap to see what params were passed to the step, and see how they differ from their value now.

Notes

put any notes here that you like. we can incorporate them into the document later, if that is helpful....
in resources XML we have a new attribute of the <resource> element called parentResource=
never hard-code a version. always use a macro.
changing sharedSteps.prop when controller is running is dangerous because the changes won't be seen until the controller stops and restarts. by then step that consume the values might have already run. this is true even if the changes desired are intended for the resources xml file.
Running a workflow: Consider running a workflow in ‘screen to capture any error output on the command line.’
If a step that runs a plugin FAILs, before you set it to READY, clean up what it wrote to the database by calling the plugin undoer:

ga GUS::Community::Plugin::Undo --plugin XXXXXX --algInvocationId ## --workflowContext --commit

DatasetLoader Steps

A set of steps in the workflow called resources acquire data from outside the workflow and make it available to other steps in the flow either as files or in the database. Or they may simply load the data into the database for use after the workflow is complete. The design pattern for a resource step is:

a resource is defined using a <resource> element in a resource XML file. For example, datasources/plasmodb/pvivax.xml
a resource XML file may contain many <resource> elements.
a <resource> has a unique name that is an index into the resource XML file.
a <resource> has a version
the resource name and version are used in the database to populate the GUS tables SRes.ExternalDatabase and SRes.ExternalDatabaseRelease
in the workflow graph, each resource is processed by a call to a subgraph called LoadResources

LoadResource steps that are in a subgraph should all be in one resources XML file

each step in the LoadResources subgraph is parameterized by

the resources XML file containing the resource definition
the name of the resource
the globalDataDir

the <unpack> and <pluginArgs> elements of a <resource> can use these macros:

@dataDir@

allows <unpack> and <pluginArgs> elements to share files

@globalDataDir@

allows <unpack> and <pluginArgs> elements to see files produced by global steps

and properties from the stepsGlobal.prop file. (eg @soVer@)

if <resource> elements within a resources XML need to share files they must navigate the nested directory structure to find the needed file, for example @dataDir@/../IEDB/iedb.fasta. This is a workaround and should be avoided.
with the exception of version= always use macros for a version, eg, %RESOURCE_VERSION% or @goVer@. The latter can be defined in stepsGlobal.prop
use a the parentResource= attribute of the <resource> to indicate that this resource is adding more data into a parent resource's dataset. In this case, the child resource does not get written into the DataSource table, and its data uses the parent's external_database_release_id. The <pluginArgs> element gets access to the parent's extDbRlsSpec info by using the %PARENT_RESOURCE_NAME% and %PARENT_RESOURCE_VERSION% macros.
the get and unpack phases of a resource might produce files that are consumed by dependent analysis steps in the graph XML. In order for test mode to work properly, the resource step must produce dummy files for these files. force that by using a <getAndUnpackOutput> element

Displaying Datasources on a web site

When the workflow loads a resource it records minimal information describing it: only its name and version. The <resource> element contains extensive provenance and descriptive information. This is stored in the ApiDB.DataSource table (and its friends). The information is placed there by the Tuning Manager, which reads the resources XML file from the repository. It only includes information for resources that are found in SRes.ExternalDatabase.name. This allows the resources XML file to accrue new resources after the flow is complete without those being seen on the website.

Testing the syntax of a resources XML file

Use the validateResourceXml program to compare your xml against the RNG schema definition ApiCommonData/Load/lib/rng/resources.rng.

Here is its usage:

$ validateResourceXml

usage: validateResourceXml -f resources_xml_file