1 of 36

2 of 36

CellProfiler for HCS data on the cloud

Anna Klemm and Nodar Gogoberidze

3 of 36

Free and open-source; Windows, Mac, Linux

Cited in 1,500+ papers per year

Used in 7/10 top pharma companies

In the Top 10 most popular papers in Genome Biology

Ranked most flexible and usable in independent analysis (Wiesmann et al.)

Anne Carpenter

Ray Jones

Lee Kamentsky

Allen Goodman

Claire McQuin

Beth Cimini

David Stirling

Alice Lucas

Nodar Gogoberidze

4 of 36

Software overview

Image analysis &

quantification

Image-centric

data analysis &

machine learning

5 of 36

Software overview

Measure everything

Ask question later

6 of 36

The CellProfiler interface

Pipeline panel

Settings panel

Module help

Start test mode

Set output folder

Start analysis run

Pipelines can be saved out as:

  • .cppipe or .json – text based, human readable (cppipe) and/or machine readable (json), best for sharing, using a pipeline in a new location
  • .cpproj – container file, contains the pipeline you made and the images that you loaded in. Best for resuming work on the machine you were already using.

7 of 36

The CellProfiler interface

8 of 36

The CellProfiler interface

The next module to run

- Will not execute

- Will execute

- Will pause during “Run”

- Won’t pause during “Run”

- Won’t show display

- Will show display

- Module set correctly

- Module has an error

- Module giving a warning

(such as “won’t run in test mode”)

Run until you hit a pause

Leave test mode

Run just the next module

Start over on the next image set

Launch the workspace viewer

Add, subtract, or reorder modules

9 of 36

The CellProfiler interface

Set what feeds into and out from every module

10 of 36

Module categories

  • File processing: Image input, file output

  • Image processing: Often used for pre-processing prior to object identification

  • Object processing: Identification, modification of objects of interest

  • Measurement: Collection of measurements from objects of interest

  • Data Tools: Measurement exploration, measurement output

  • Advanced: Typically modules for 3D analyses

  • Worm Toolbox: C. elegans-specific operations

Search modules for keywords

11 of 36

CellProfiler figure windows

  • The figure window has additional menu options

  • Toolbar menu: Home, pan, zoom in/out

  • CellProfiler Image Tools
    • Show pixel data (location, intensity)
    • Measure length between any two points just by clicking and dragging

12 of 36

Tips for creating a good high content analysis workflow

https://carpenter-singh-lab.broadinstitute.org/blog/when-to-say-good-enough

  • When finding the objects that you care about, ask yourself for your whole experiment:
      • Do I generally agree with most of the object segmentations from my analysis workflow?
      • Do I have an approximately equal number of regions/images where the threshold chosen by the algorithm for this image is a bit too low vs a bit too high?
      • Do I have an approximately equal number of oversegmentations/splits and undersegmentations/merges?
      • Very important: Do both the second and third bullet points hold true for both my negative control images and my positive control (or most extreme expected phenotype(s) sample) images?

13 of 36

“The thing I want to do doesn’t exist in CellProfiler!”

  • Are you sure it doesn’t?
    • Search the help, and/or post on image.sc
  • Is it an image processing utility that exists in ImageJ/Fiji?
    • Try out the RunImageJMacro module, which can point to your system ImageJ/Fiji
  • Go ahead and write your own!

14 of 36

Running on large image sets on CellProfiler

  • A few – a few hundred images
    • Can likely run on your local machine
    • CellProfiler will automatically multithread process up to your number of CPUs
  • A few hundred – a few tens of thousands of images
    • Talk to your local sysadmin about running on a cluster (directly or with Docker)
    • Check out our instructions on getting started
  • A few tens of thousands – a few million images
    • Can consider cloud processing
    • Check out our Distributed-CellProfiler package for running on AWS

15 of 36

Batch files

  • Easy way to transition from running locally to on the cluster
  • Data needs to have same structure on the cluster as on your local machine; path mapping needs to be right
  • Creates a .h5 file you can move to the cluster to run

16 of 36

In practice, how do I run CellProfiler headlessly?

cellprofiler –c –r –p path/to/pipeline –o some/directory {INPUT} {GROUPINGS}

how to group

which files

output folder

pipeline location

headless flags

executable call

Even if you’re running this ”wrapped” in a service somewhere, it’s important to know what information CellProfiler needs!

https://carpenter-singh-lab.broadinstitute.org/blog/getting-started-using-cellprofiler-command-line

17 of 36

In practice, how do I run CellProfiler headlessly?

cellprofiler –c –r –p path/to/pipeline –o some/directory {INPUT} {GROUPINGS}

executable call

how to group

which files

output folder

pipeline location

headless flags

  • Executable call –
    • If installed in Python – cellprofiler or python –m cellprofiler or python3 –m cellprofiler
    • Windows executable - C:\Users\UserName\ProgramFiles\CellProfiler\CellProfiler.exe
    • Mac executable - /Applications/CellProfiler/Contents/MacOS/cp
    • Executables can be dragged and dropped to terminal

https://carpenter-singh-lab.broadinstitute.org/blog/getting-started-using-cellprofiler-command-line

18 of 36

In practice, how do I run CellProfiler headlessly?

cellprofiler –c –r –p path/to/pipeline –o some/directory {INPUT} {GROUPINGS}

headless flags

executable call

how to group

which files

output folder

pipeline location

  • Headless flags – always the same, no need to adjust

https://carpenter-singh-lab.broadinstitute.org/blog/getting-started-using-cellprofiler-command-line

19 of 36

In practice, how do I run CellProfiler headlessly?

cellprofiler –c –r –p path/to/pipeline –o some/directory {INPUT} {GROUPINGS}

pipeline location

headless flags

executable call

how to group

which files

output folder

  • Pipeline location –
    • Can be a .cppipe file or a batch file created with CreateBatchFiles - .cpproj generally does not work well

https://carpenter-singh-lab.broadinstitute.org/blog/getting-started-using-cellprofiler-command-line

20 of 36

In practice, how do I run CellProfiler headlessly?

cellprofiler –c –r –p path/to/pipeline –o some/directory {INPUT} {GROUPINGS}

output folder

pipeline location

headless flags

executable call

how to group

which files

  • Output folder–
    • Where you want your output to go
    • In your CellProfiler pipeline, ensure all exporting modules (SaveImages, SaveCroppedObjects, ExportToSpreadsheet, ExportToDatabase) are using “Default Output Folder” (or a subfolder of it) as their export location

https://carpenter-singh-lab.broadinstitute.org/blog/getting-started-using-cellprofiler-command-line

21 of 36

Getting data into CellProfiler - Input Modules

  • 4 modules in total, handle all the “bookkeeping” of what your experimental setup is
    • Images (mandatory) – tell CellProfiler which images you want to analyze
    • Metadata (optional if one field of view per file) – give CellProfiler metadata from the file header OR file name
    • NamesAndTypes (mandatory) – tell CellProfiler if 2D vs 3D, how to break down channels, any other bookkeeping
    • Groups (mandatory for tracking, Z projection, or whole-plate correction pipelines, recommended for cluster processing, otherwise optional) – tell CellProfiler if it is important to keep any image sets together during processing
  • See a great blog post about this, with links to a video tutorial, at �broad.io/CellProfilerInput

22 of 36

Getting data into CellProfiler - LoadData

  • Create a CSV that instructs CellProfiler on how the images should be parsed – path and file name for each channel, any metadata you want included
  • You can add grouping and/or filtering to specific rows in the LoadData module settings
  • Handy if you’re comfortable scripting, and your data names are regularized!

23 of 36

In practice, how do I run CellProfiler headlessly?

cellprofiler –c –r –p path/to/pipeline –o some/directory {INPUT} {GROUPINGS}

which files

output folder

pipeline location

headless flags

executable call

how to group

  • Which files–
    • If you’re using Load Data
      • --data-file path/to/file.csv
    • If you’re using the Input modules and a .cppipe file:
      • Point to a folder on your cluster, run on all images there: --i path/to/folder
      • Pass in a text file listing images: --file-list path/to/file.txt
    • If you’re using the Input modules and a batch file:
      • Nothing needs to be entered here, it’s encoded in the batch file

https://carpenter-singh-lab.broadinstitute.org/blog/getting-started-using-cellprofiler-command-line

24 of 36

In practice, how do I run CellProfiler headlessly?

cellprofiler –c –r –p path/to/pipeline –o some/directory {INPUT} {GROUPINGS}

how to group

which files

output folder

pipeline location

headless flags

executable call

  • How to group–
    • Some workflows (e.g. tracking, plate illumination correction) demand particular groupings (typically metadata-based)
    • Grouping otherwise allows parallelization – rather than a small number of CPUs running �thousands of files, thousands run small numbers of files
    • Group by metadata: -g Metadata_Well=A01
    • Group by image set count: -f 11 -l 20
    • If using a batch file, you can get it to print all the groups present: �--get-batch-commands-new
      • Add this to use -f/-l flags: --images-per-batch

https://carpenter-singh-lab.broadinstitute.org/blog/getting-started-using-cellprofiler-command-line

25 of 36

Ok, I get the principles, how do I ACTUALLY do this?

26 of 36

Your local cluster

  1. Install CellProfiler (4+ recommended, 3 and below run Python 2 which is past end of life) on your cluster
  2. Generate execution commands for the job in question (manually or using the flags demonstrated above)
  3. Put into your cluster’s submission system

  • Pro’s:
    • Local
    • Likely free to you
  • Con’s:
    • Dependent on local bandwidth
    • Likely need IT support for setting up 1 and 3
      • Installation SHOULD be smooth, but…
    • Hard to support multiple CellProfiler versions

27 of 36

Containerization solves installation and version issues

  • Containerization: someone installs it once, you use their installation in a tiny OS “box” forever after
    • Reproducible!
    • Use it anywhere!*
    • You personally only have to install a program to run containers, and never anything else again!
    • Typically involves some code to run; many containers do not come with GUIs (or can be painful to use them)
    • Groups such as biocontainers have already made a LOT (>1000) of them
    • Developers tend to prefer Docker containers, sysadmins Singularity containers (but Singularity can run Docker containers)

https://biocontainers.pro/

28 of 36

Docker

  • Can be local (your own machine, your university cluster) or somewhere in the cloud
  • docker run \ �--volume=some/input/folder:/input \�--volume=some/output/folder:/output \�cellprofiler/cellprofiler:4.2.4 \�cellprofiler –c –r –p path/to/pipeline –o some/directory {INPUT} {GROUPINGS}�
    • First line tells Docker to run a container
    • Second line to mount where your images are located
    • Third where you want your output to be
    • Fourth line is the container (can use your own/other versions too)
    • Fifth line you should already understand!

https://github.com/CellProfiler/distribution/blob/master/docker/Dockerfile

29 of 36

Galaxy

  • An easy-to-use (for end users) way to put a GUI onto an analysis, as well as make it shareable and reproducible
    • Developers need to create an XML file that “wraps” the analysis and tells Galaxy what type of input to expect, what type of output to expect, etc
  • Can run interactive tools such as Jupyter, etc
  • Many instances running on many physical pieces of hardware all over the world
  • CellProfiler can be run very simply in the Galaxy Imaging node – v3.1.9 and v4.2.1 ONLY, and only single threaded (no grouping flags) for now. In 3.1.9 can build a pipeline from modules, 4.2.1 only run premade .cppipe files.
  • Pro’s:
    • Easy to use
    • Likely free to you
    • Easy to share analyses, make them reproducible, etc
  • Con’s:
    • Dependent on bandwidth of your Galaxy host
    • Creating a wrapper can be painful for new developers

https://imaging.usegalaxy.eu/

https://training.galaxyproject.org/training-material/topics/imaging/tutorials/object-tracking-using-cell-profiler/tutorial.html

30 of 36

Terra

  • Terra.bio – made by Broad Institute, Verily (Google), and Microsoft
  • Run analyses in Google Cloud, on data stored there OR in Azure OR in Amazon Web Services (AWS); can also be used to run Galaxy
  • Can run interactive tools such as Jupyter, or workflows by making a “wrapper” using WDL (Workflow Description Language)
  • Pro’s:
    • In the cloud, so bandwidth is never an issue
    • Lots of example workflows, especially in genomics
  • Con’s:
    • Not free – though can get $300 in credits
    • Current implementations may not support all grouping strategies, only support .cppipe
    • Another workflow language to learn!

https://imaging.usegalaxy.eu/

31 of 36

Distributed-CellProfiler

  • Run CellProfiler in the cloud on AWS
  • No need to know how to code, just edit a configuration file and execute pre-made scripts
  • Pro’s:
    • In the cloud, so bandwidth is never an issue
    • Just need to fill out a pre-made JSON file, no coding required
    • Extends out to non-CellProfiler projects with the rest of the DistributedScience universe
    • Made by the CellProfiler team, so good integration – supports batch files, grouping, etc
  • Con’s:
    • Not free
    • Command line-only

https://github.com/DistributedScience

32 of 36

How can I learn how to do this stuff?��Where can I go for help?

33 of 36

forum.image.sc - Open scientific community forum for bioimage analysis and beyond�

34 of 36

Center for Open Bioimage Analysis

Openbioimageanalysis.org

35 of 36

Gratitude

Recent major funding for this work provided by:

  • CZI Imaging Scientist Fellowship
  • NIH NIGMS: MIRA R35 GM122547
  • CZI Software Fellows program
  • NIH NIGMS: P41 GM135019

Many thanks to our

many biology collaborators

Beth Cimini

Mario Cruz

Barbara Diaz-Rohrer

Fernanda Fossa

Melissa Gillis

Nodar Gogoberidze

Serena Larew

Andréa Papaleo

Marine Secchi

Rebecca Senft

Callum Tromans-Coia

Erin Weisbart

Anne Carpenter�Shantanu Singh

John Arevalo�Niranj Chandrasekaran

Marzieh Haghighi

Yu Han

Alexander Kalinin

Serena Larew

Becki Ledford

Robert van Dijk

Cimini Lab members

IMAGING

PLATFORM

Carpenter-Singh Lab members

36 of 36

Hands-on

  • Activity – run CellProfiler headless on your own machine
    • Can use the “Beginner Segmentation” images and ”final” pipeline from tutorials.cellprofiler.org
    • Get it to run headlessly with first and last flags, as well as grouping flags- what must you do in CellProfiler to get those to work?
    • Optionally, install Docker on your machine, and try to do the same thing with the CellProfiler Docker
  • Reminder, you can get these slides at broad.io/neubias23