Writing composable pipelines, using nesoni

http://bioinformatics.net.au/software.nesoni.shtml

https://github.com/Victorian-Bioinformatics-Consortium/nesoni

paul.harrison@monash.edu

I'm going to talk about a python library I've been working on. It's included in the VBC's swiss army knife of python bioinformatics tools, Nesoni, and trys to make the following sequence of development as smooth as possible:

1. A run-once script to analyse some specific data set.

2. A script that re-runs only from where it needs to
(eg because of a changed parameter).

3. A script that executes in parallel, and is more intelligent about what it needs to re-run

4. A reusable tool or tools with a command-line and python interface, that can be used by other researchers, that can give others a way to exactly reproduce the results of a paper, or apply it to their own data, and that can be used as a part of future scripts.

        

5. A collection of kitchen-sink tools, each with many options and modes, which only produce meaningful output when used in the correct sequence with the correct options.

You can move as far through the sequence as you need to. You don't have to do any work up front for things later in the sequence.

Nesoni command line

Some of you may have used nesoni from the command line. Here's a typical example of what this looks like:

nesoni clip: \               name of the tool

    --quality 15 \           a flag

    result \                 a positional parameter

    reads: s.txt             a section

Nesoni tools often produce several output files, which are either given a common prefix or placed in a directory.

nesoni clip: is a good example of a stage 5 tool in desperate need of breaking into more manageable components! Here's the help:

nesoni clip:
   --adaptors
   truseq-adapter,truseq-srna,genomic,multiplexing,pe,srna
   # Which adaptors to use. Comma seperated list from:
   # truseq-adapter,truseq-srna,genomic,multiplexing,pe,srna,dpnii,nlaiii
   # or "none".
   --match 10
   # Minimum length adaptor match.
   --max-errors 1
   # Maximum errors in adaptor match.
   # Note: slow for N > 1
   --clip-ambiguous yes
   # Clip ambiguous bases, eg N
   --quality 10
   # Quality cutoff
   --qoffset NNN
   # Quality character offset.
   # sanger: 33, solexa: 59, illumina: 64
   # Will guess if not given.
   --length 24
   # Reads shorter than this will be discarded.
   --homopolymers no
   # Set to yes to *remove* reads containing all the same base.
   --trim-start 0
   # Trim NNN bases from start of each read irrespective of quality.
   --trim-end 0
   # Trim NNN bases from end of each read irrespective of quality.
   --revcom no
   # Reverse complement all reads.
   # ie convert opp-out pairs to opp-in
   --fasta no
   # Output in fasta rather than fastq format.
   --gzip yes
   # Gzip output.
   --rejects no
   # Output rejected reads in separate file.
   prefix
   # Prefix for output files.
   reads:
   # Files containing unpaired reads.
   interleaved:
   # Files containing interleaved read pairs.
   pairs:
   # Pair of files containing read pairs.

Clip adaptors and low quality bases from Illumina reads or read pairs.

import nesoni

Tool classes

Each nesoni command line tool has a corresponding class (a subclass of nesoni.config.Action).

An instance of the class represents an invocation of the tool with the parameters that you want. It can be executed with the .run() method.

action = nesoni.Clip(prefix='result', quality=15, reads=['s.txt'])

action.run()

Parameters have the same name as on the command line, except dashes become underscores. You can either consult the command line help or look at help in pydoc.

   pydoc nesoni

or python -m pydoc nesoni

or pypy -m pydoc nesoni

The invocation of the tool is reified, it has been made into a thing. Why not just have a function that can be called directly? There had better be a payoff for this extra complication!

Customized tools

You can easily create a customized copy of a tool object, in exactly the same way as you create an instance of a tool class in the first place. This allows prototype-oriented programming.

For example, create a prototype object with parameters you want then derive copies to be applied to different files.

template_action = neosni.Clip(quality=15)

action = template_action(prefix='result', reads=['s.txt'])

(Another way to do this is to create a subclass of the Clip class.)

Later, we’ll see that by creating tools that accept a prototype object as a parameter, we can program in a style a C++ developer might call template metaprogramming, or induce a Java developer to talk about factories and inversion of control.

Make

Calling .run() will always run the tool. Calling .make() will run the tool if either its parameters have changed, or a tool needed to be “make”d earlier in the execution of the script. So using make means the script executes the tools from the first tool with changed parameters.

The parameters used in previous invocations of the script are stored in a directory called “.state” in the current directory. When developing tools, or if the contents of a file are changed, it is sometimes necessary to manually delete files from the “.state” directory.

If you structure your script as below, you’ll get some useful command line options to control making (any remaining command line arguments are passed to the script as parameters):

import nesoni
 
def my_script():

    ...
           
if __name__ == '__main__':
   nesoni.run_script(my_script)

From serial script to parallel script

nesoni uses the multiprocessing module to provide parallel execution.

Parallelizing a script has two advantages:

In a serial script, everything after the first remake needs to be remade. In a parallel script, only things that must execute after things that need remaking are remade.

Futures: Internally, nesoni uses the concepts of “futures”. When a future is created, a process is started to calculate its value. The future can later be accessed, which waits for the process to finish if necessary, then returns the value. The need to remake infects a process when the value of the future is accessed.

import nesoni

def fun():

    return (9**8**7)%6)

if __name__ == '__main__':

    f = nesoni.future(fun)

    # ... do other things ...

    print f()

Functions and parameters passed to futures need to be picklable. Make sure your script wraps any code to execute in "if __name__ == '__main__':", or it will fork bomb!

( There is also a lighter-weight version of future that runs the function in a thread rather than a separate process, called thread_future. This is much faster, and doesn't require picklability, but execution is limited to a single core by Python's GIL (Global Interpreter Lock). )

Depth-first execution: If there are as many processes running as there are cores available, the future runs immediately but the creator of the future waits for a core to be freed. Cores are made available to processes in last-in-first-out order. With a single core the program executes in depth first order, like a serial program would.

Stage: The Stage class provides a convenient way to run a set of parallel processes. It's like a stage on which your processes perform.


Example:

if __name__ == '__main__':

    stage = nesoni.Stage()

    samples = ['foo', 'bar', 'baz']

    for name in samples:

        nesoni.Clip(

            prefix=name,

            reads=[ 'input/'+name+'_reads.txt' ]

        ).process_make(stage)

    stage.barrier()

    # The results from clipping can now be safely used.

Complex dependency patterns: For most situation the Stage class should be sufficient to juggle your processes, but if you insist on doing something complex, note that futures are themselves picklable, and can be passed as parameters to other futures.

from nesoni import config

From once-off script to parameterized tool

Tools inherit from nesoni.config.Action:

from nesoni import config

class My_tool(config.Action):

    def ident(self):

        return 'my-tool'

    def run(self):

        ... do stuff ...

Slightly tricky point: The ident() method determines the name of the file created in the .state directory. It should include the name of the output file (or directory or prefix), in order for the make system to work. Nesoni tools generally also include the name of the tool, as there can be various different tools that operate on the one directory. See also the subclasses nesoni.config.Action_with_prefix and nesoni.config.Action_with_output_dir, below, which provide a sensible ident()for you.

Tools can have parameters. There are a few different ways I could have done this, in the end I settled on using class decorators. Give some help text too! :

@config.help(

    'A short description of my tool.',

    'A more detailed description of the tool.'

)

@config.Int_flag('param', 'Help text for the parameter.')

class My_tool(config.Action):

    param = 42

    def ident(self):

        return 'my-tool--%d' % self.param

    def run(self):

        ... do stuff, with self.param ...

Different kinds of parameters

@config.String_flag(name, help)

@config.Int_flag(name, help)

@config.Float_flag(name, help)

@config.Bool_flag(name, help)

@config.Positional(name, help)

@config.Section(name, help)

@config.Grouped_section(name, help)

@config.Main_section(name, help)

@config.Configurable_section(name, help)

You can easily create your own kinds of parameter, using the existing examples as a template.

Action classes

There are some base classes with interfaces in the nesoni idiom defined in nesoni.config.

Configurable

    Action

        Action_with_log

            Action_with_prefix

            Action_with_output_dir

            Action_with_working_dir

        Action_with_optional_input

        Action_with_optional_output

            Action_filter (inherits from both)

Configurable supports multiple inheritance, including inheritance of parameters using some metaclass magic.

Action_with_prefix, Action_with_working_dir, and Action_with_output_dir know where the output is going, and therefore provide a sensible default ident() method.

Instances of Action_with_log, while running, magically have a self.log attribute, an instance of nesoni.grace.Log. A log file will be created with a tool command-line parameters and execution time. You can also log information by selling self.log.log('my text\n').

Instances of Action_with_output_dir and Action_with_working_dir have a .get_workspace() method that produces an instance of nesoni.workspace.Workspace.

I'm unsure of the merit of Action_with_optional_input/output and Action_filter.

Running your tools on the command line

You can make a tool accessible from the command line with:

if __name__ == '__main__':

    nesoni.run_tool(My_tool)

If you have a collection of tools, use:

if __name__ == '__main__':

    nesoni.run_toolbox([ My_tool1, My_tool2, ... ])

If you are writing a package (oops, we've just gone from a once-off script to quite a large project!), instead of putting this under "if __name__ == '__main__':" you can put it in a file called __main__.py in the package directory. This makes it available via:

python -m my_package_name ... 

It's nice to also provide a script which simply consists of:

#!/usr/bin/env python

import my_package_name.__main__

Write the usual setup.py distutils script, and distribute.


from distutils.core import setup
import mypackage

setup(name='mypackage',
     version=mypackage.VERSION,      
     packages = [ 'mypackage' ],
     scripts = [ 'scripts/mypackage' ],
     classifiers = [
         'License :: OSI Approved :: GNU General Public License v2 or later (GPLv2+)',
     ],
     url = '...',
     author = '...',
     author_email = '...',
)

Historical artifacts and open questions

How complex should a tool be?

Nesoni has grown orgaincally. When I began the project Python was much slower than PyPy is now, and I didn't want too many tools to need to be used in an analysis. This encouraged me to write sprawling routines that conflated several concerns. For example alignment filtering, depth of coverage plots, and SNP calling were all handled by the same tool. This has now been broken up into two tools, which is still far from ideal.

With the new tool system, many of these tools could be broken down into lower level tools that handle single concerns. There would then be a collection of pipeline tools that operate at a somewhat higher level than the current tools.

It may still make sense to combine the execution of various tools. This does still have speed advantages, and requires less intermediate files. There are often trivial processing steps that would take no longer to recompute than to reload from disk. Python's support for co-routines, "generators", makes it possible to write cleanly separated code that executes together. I'm currently unsure how this would integrate into the system of composable tools I've just described.

Naming and organization of files

The working-directory system, with files expected to have certain names within a directory, is rather inflexible. On the other hand, it makes for straightforward use of several downstream tools. The directory provides a lot of automatic context, such as location of the reference sequence and annotations.

In other places in nesoni, there is no such abstraction where it seems there could be. For example, there is much duplicated code to let a tool take reads or a pair of files containing read pairs or a single file containing interleaved read pairs.

A system sometimes used elsewhere is to have each filename contain a complete history of how the file was created.

In general, often we are not working with single files so much as bundles of files and meta-information describing their provenance. It seems this should be reified into python objects, in much the same way that I've reified tool invocations as python objects here.

Wishlist