1 of 18

Total Perspective Vortex (TPV): Seamlessly managing job destinations in Galaxy when users can’t be trusted with resources

Sanjay K. Srikakulam

Albert-Ludwigs-Universität Freiburg

2 of 18

So, what exactly is Galaxy and what can it offer?

  • An open-source platform for scientific data analysis
  • A graphical web interface
    • for running >9,000 tools
  • A FAIR data platform
  • A powerful workflow system
  • Data import/export from/to various services
  • Supports AAI
  • Galaxy: An RDM
  • Galaxy: A VRE
  • A community
    • Notably exceeding the bio space
    • Material sciences, astronomy,
  • 18 years and thriving

3 of 18

Distributing analysis across computing resources

https://pulsar-network.readthedocs.io

  • Democratizing and federating data analysis

4 of 18

Galaxy ecosystem

TIaaS

BYOS

BYOC

5 of 18

  • Infrastructure and compute
    • deNBI Cloud
    • Openstack + Terraform + Ansible + CI/CD + TIG stack
    • > 95K registered users

UseGalaxy.EU

  • Stats
    • 106 TiB of data per month
    • ~1.3M jobs per month
    • ~1500 new users per month
    • Every 3s one job finishes

6 of 18

Total Perspective Vortex (TPV) - Motivation

7 of 18

Total Perspective Vortex

  • A Python framework designed to dynamically schedule millions of jobs, configurable via Galaxy using specific rules.
  • Utilizes a shared YAML file/database for configuration, incorporating resource allocation rules from community experts.
  • TPV can access extensive metadata, including user objects, provenance information, data locality, size, etc.
  • Distributes jobs to remote resources such as ARC and Pulsar, making intelligent, dynamic decisions for optimal job routing.
  • Enhances scalability and efficiency in federated resource environments, providing fine-grained control over job scheduling and simplifying administration.

8 of 18

TPV – Configuring resource requirements for a tool

tools:

bowtie2.*:

cores: 6

mem: cores * 4

gpus: 0

env: []

destinations:

slurm:

cores: 24

mem: 64

gpus: 1

arc01:

cores: 8

mem: 32

gpus: 0

TPV will route to a destination where the job will fit

9 of 18

TPV – Intelligent resource selection

arc01 would have been a better fit because of the highmem tag, but it is currently marked offline, so slurm it is.

tools:

bowtie2.*:

cores: 12

mem: cores * 4

gpus: 0

env: []

scheduling:

require: []

prefer:

- highmem

accept:

reject:

- offline

rules: []

destinations:

slurm:

cores: 16

mem: 64

gpus: 2

scheduling:

prefer:

- general

arc01:

cores: 16

mem: 64

gpus: 0

scheduling:

prefer:

- highmem

reject:

- offline

10 of 18

TPV – Routing rules and inheritance

tools:

bowtie2:

cores: 4

mem: 16

rules:

- if: input_size > 10

cores: 16

mem: 32

- if: input_size >= 20 and input_size <=50

scheduling:

require:

- highmem

- if: input_size >= 55

fail: Size (input_size) is too large

Embedded Python expressions

Conditional tags

Contextualized errors

11 of 18

TPV – Routing rules and inheritance

tools:

bowtie2:

cores: 4

mem: 16

rules:

- if: input_size > 10

cores: 16

mem: 32

- if: input_size >= 20 and input_size <=50

scheduling:

require:

- highmem

- if: input_size >= 55

fail: Size (input_size) is too large

tools:

bwa_mem.*:

inherits: bowtie2

cores: 16

12 of 18

TPV – Metascheduling

global:

default_inherits: default

tools:

default:

cores: 1

mem: cores * 3.8

gpus: 0

env: []

params: []

scheduling:

reject:

- offline

rules: []

rank: |

final_destinations = helpers.weighted_random_sampling(candidate_destinations)

final_destinations

Custom rank functions, implemented via Python code, enable advanced metascheduling by selecting the optimal destination from a list, defaulting to the most preferred option if no custom function is provided.

13 of 18

TPV – Current developments

Job + Destination data

Best destination recommendation

TPV

Job data

Additional metadata / Recommended destination info

Jobs

Destination usage stats

Destination metrics

  • Optimize job scheduling with data proximity
  • Smart resource utilization
  • Edge intelligence at fingertips

14 of 18

Summary

  • A flexible system that makes smart, dynamic decisions to optimize job routing and resource utilization.
  • Utilizes a shared YAML file/database for resource allocation rules, reducing admin overhead.
  • Collection of rules contributed by community experts, minimizing configuration repetitiveness.
  • Adjusts resources based on metadata, tools, and other criteria.
  • Capable of distributing jobs to remote resources such as ARC and Pulsar.
  • Enhances user experience

15 of 18

Thank you!

16 of 18

17 of 18

Object stores and Bring Your Own Storage

  • Object stores
      • User level
      • History level
      • Workflow level
      • Job level
  • Well annotated, classified, and visually explained
  • Bring your own storage
      • All data relevant to your job will be stored only on your storage
      • Vault to store your credentials
      • Secured data analysis is one of the use cases

18 of 18

Bring Your Own Compute

  • Deploy a Pulsar instance on your favorite cloud or your local computing infrastructure
    • Detailed documentation on how to, including recipes, is already available at https://github.com/usegalaxy-eu/pulsar-deployment
  • Connect it to Galaxy in a few simple steps
  • Use Galaxy to submit jobs to your compute infrastructure
  • Less maintenance and administration
  • Leverage the vast number of tools, workflows, and materials