1 of 53

"Smarter" job scheduling using (user) provided infrastructure

Grant agreement 101057388

2 of 53

EuroScienceGateway

Leveraging the European compute infrastructures for

data-intensive research guided by FAIR principles

1

Smarter job scheduling | Paul De Geest

3 of 53

Institutions involved

2

Smarter job scheduling | Paul De Geest

4 of 53

Motivation

3

National Cloud and HPC infrastructures have been established, with differences in

Hardware
Configuration
Software stack
Authentication and Authorization
Access typically targeted at local researchers

goal: provide efficient and structured access to data, tools and workflows supported by suitable IT infrastructures.

Smarter job scheduling | Paul De Geest

5 of 53

The Pulsar Network

4

The Pulsar Network is a distributed job execution system, allowing to scale the computing resources available to Galaxy instances over heterogeneous and distributed compute facilities.

6 national Galaxy instances will make use of the Pulsar Network
At least 10 Pulsar trusted endpoints, routing the incoming jobs from Galaxy and other workflow management systems to local compute resources

Smarter job scheduling | Paul De Geest

6 of 53

User provided infrastructure

Bring Your Own Compute

Allow users to add new, external Pulsar endpoints

5

Smarter job scheduling | Paul De Geest

7 of 53

User provided infrastructure

Bring Your Own Compute

Allow users to add new, external Pulsar endpoints
User preference form to add new Pulsar endpoint to Galaxy server

6

Smarter job scheduling | Paul De Geest

8 of 53

User provided infrastructure

Bring Your Own Compute

Allow users to add new, external Pulsar endpoints
User preference form to add new Pulsar endpoint to Galaxy server
User preference form to select preferred Pulsar endpoint

7

Smarter job scheduling | Paul De Geest

9 of 53

User provided infrastructure

Bring Your Own Storage

8

Smarter job scheduling | Paul De Geest

10 of 53

Smart job-scheduling

9

How do we efficiently schedule jobs from any UseGalaxy.* server to any Pulsar endpoint in the Pulsar network or a user-defined compute endpoint?

Smarter job scheduling | Paul De Geest

11 of 53

Set up

10

Smarter job scheduling | Paul De Geest

12 of 53

Set up

11

TPV

Meta-scheduler

Job Visualization

Smarter job scheduling | Paul De Geest

13 of 53

Set up

12

TPV

Meta-scheduler

Job Visualization

BYOC

BYOS

Smarter job scheduling | Paul De Geest

14 of 53

Time-series Database

13

Central database for collecting time-series info about all Pulsar destinations in the network:

Destination-tool related metrics from the POV of each of the Galaxy servers with access to the Pulsar network
(Available) resource info from each of the Pulsar destinations

Smarter job scheduling | Paul De Geest

15 of 53

Current and aggregate stats

14

Median job count

Median destination-tool queue/run times

Median Pulsar Load

Smarter job scheduling | Paul De Geest

16 of 53

Total Perspective Vortex

15

Initial matchmaking between the job requirements and the available compute endpoints
Access to Galaxy db and other static info :

queue state of the Pulsar endpoints
dataset size, tool id
objectstore and pulsar endpoint geolocation

Smarter job scheduling | Paul De Geest

17 of 53

Total Perspective Vortex

16

Initial matchmaking between the job requirements and the available compute endpoints
Access to Galaxy db and other static info :

queue state of the Pulsar endpoints
dataset size, tool id
objectstore and pulsar endpoint geolocation

Smarter job scheduling | Paul De Geest

18 of 53

Meta-scheduling

Central API endpoint

17

Implements matchmaking logic, ranking available pulsar destinations based on:

current - and historically collected (tool-destination) metrics
data locality (geolocation) of the objectstore/destination
dataset size

Smarter job scheduling | Paul De Geest

19 of 53

Meta-scheduling

Central API endpoint

18

Implements matchmaking logic, ranking available pulsar destinations based on:

current - and historically collected (tool-destination) metrics
data locality (geolocation) of the objectstore/destination
dataset size

Smarter job scheduling | Paul De Geest

20 of 53

Meta-scheduling

Central API endpoint

19

Implements matchmaking logic, ranking available pulsar destinations based on:

current - and historically collected (tool-destination) metrics
data locality (geolocation) of the objectstore/destination
dataset size

Currently, simple weighting of the different metrics

Working on add more advanced Fuzzy/Adaptive-based matchmaking algorithms

Smarter job scheduling | Paul De Geest

21 of 53

Set up

20

TPV

Meta-scheduler

Job Visualization

BYOC

BYOS

Smarter job scheduling | Paul De Geest

22 of 53

Visualisation

21

Smarter job scheduling | Paul De Geest

23 of 53

Future Work

Enhancing data locality and performance by unifying multiple (user-) object stores into a single OneData object store

22

Smarter job scheduling | Paul De Geest

24 of 53

Thanks!

All ESG members and specifically

23

Smarter job scheduling | Paul De Geest

25 of 53

Institutions involved

24

Smarter job scheduling | Paul De Geest

26 of 53

The Pulsar Network

25

The Pulsar Network is a distributed job execution system, allowing to scale the computing resources available to Galaxy instances over heterogeneous and distributed compute facilities.

6 national Galaxy instances will make use of the Pulsar Network
At least 10 Pulsar trusted endpoints, routing the incoming jobs from Galaxy and other workflow management systems to local compute resources

Smarter job scheduling | Paul De Geest

27 of 53

Smart job-scheduling

26

Schedule jobs from any UseGalaxy.* server to any Pulsar endpoint in the Pulsar network or a user-defined compute endpoint
Central InfluxDB to collect current status and statistics of the available compute endpoints
Standalone API endpoint that processes these statistics using several algorithms to decide on an ordered list of destinations:

Fuzzy-based matchmaking comparing job requirements with historically collected resource information in addition to locality based preemption
Adaptive-based matchmaking comparing full historically collected resource information in addition to locality based preemption

Smarter job scheduling | Paul De Geest

28 of 53

Smart job-scheduling

Add some in practice example of the workflow (tpv ranking function, example statistics we collect, API?, example algorithms in the API?)

27

Smarter job scheduling | Paul De Geest

29 of 53

Smart job-scheduling

28

Smarter job scheduling | Paul De Geest

30 of 53

Smart job-scheduling

29

Data collection

Smarter job scheduling | Paul De Geest

31 of 53

Smart job-scheduling

Extending Galaxy’s scheduling with

Latitude and longitude for object stores
Latitude and longitude for compute destinations

Goal: enable allocation of workflows near the data

Data locality

30

Smarter job scheduling | Paul De Geest

32 of 53

Smart job-scheduling

Metascheduling

Analyzing the available options

Extend DIRAC, the EGI Workload Manager service in the High Throughput Computing grid
Implement a standalone API endpoint from scratch

Testing

Road testing the DIRAC Pilot Factory
Evaluating two metascheduling algorithms with the InterGridSim Simulator

Fuzzy-based matchmaking comparing job requirements with historically collected resource information in addition to locality based preemption
Adaptive-based matchmaking comparing full historically collected resource information in addition to locality based preemption

31

Smarter job scheduling | Paul De Geest

33 of 53

Motivation

Galaxy can be deployed on a laptop

32

Smarter job scheduling | Paul De Geest

34 of 53

Motivation

Galaxy can be deployed on top of a large compute cluster

33

Smarter job scheduling | Paul De Geest

35 of 53

Bring Your Own Storage (BYOS)

Existing approach and way forward

Limitations of existing storage options

The storage needs to be provided by the admin
The service provider needs to sustain the storage
Storage is limited
Storage does not scale nicely with growing number of users

Proposal

Empower users to bring their own storage

34

Smarter job scheduling | Paul De Geest

36 of 53

Bring Your Own Storage (BYOS)

Galaxy File Source Plugin added for Onedata

35

Smarter job scheduling | Paul De Geest

37 of 53

Bring Your Own Storage (BYOS)

Bring your own Object Storage via S3

36

Smarter job scheduling | Paul De Geest

38 of 53

Bring Your Own Storage (BYOS)

Bring your own Object Storage via S3

Secrets stored in Galaxy's vault
This vault can be access by jobs (e.g. to push data back to an object store from a Pulsar endpoint)

37

Smarter job scheduling | Paul De Geest

39 of 53

Bring Your Own Storage (BYOS)

Own storage as default one

38

Smarter job scheduling | Paul De Geest

40 of 53

Bring Your Own Storage (BYOS)

Choose preferred storage per history, workflow and tool

39

Smarter job scheduling | Paul De Geest

41 of 53

40

Deliverable/Milestone	Due date	Verification Method	Progress Status (%)
D4.1 Bring Your Own Infrastructure (compute, storage) Demonstrator	31-Aug-2024	Report	50%
D4.2 Publication on the smart job scheduler implementation	28-Feb-2025	publication	20%
M4.1 BYOC and BYOS integrated into ESG	31-Aug-2023	Software available	100%
M4.2 Meta-scheduler model for job optimisation available	28-Feb-2024	Software available	30%

Smarter job scheduling | Paul De Geest

42 of 53

Collaborations & Future Work

Björn Grüning (ALU)

41

Smarter job scheduling | Paul De Geest

43 of 53

Collaborations & Future Work

ESG/Galaxy and related projects in previous EOSC projects

42

Smarter job scheduling | Paul De Geest

44 of 53

Collaborations & Future Work

ESG/Galaxy and related projects in previous EOSC projects

43

Smarter job scheduling | Paul De Geest

45 of 53

Collaborations & Future Work

EMBL/EBI BioModels “Run in Galaxy” Button
OSCARS - Open Science Cluster’ Action for Research & Society
EMBL/EBI MGnify pipelines in Galaxy
EBP assembly and genome annotation

additional projects

44

BioModels

Smarter job scheduling | Paul De Geest

46 of 53

Collaborations & Future Work

Galaxy Community Conference in the Czech Republic
Bring your own Storage (BYOS)
Bring your own Compute (BYOC)
A global network of resources and researchers sharing tools, workflows and data

The next year!

45

BYOS

BYOC

Smarter job scheduling | Paul De Geest

47 of 53

48 of 53

47

49 of 53

48

50 of 53

49

51 of 53

Introduction and Overview

EuroScienceGateway will leverage a distributed computing network across 13 European countries, accessible via 6 national, user-friendly web portals, facilitating access to compute and storage infrastructures across Europe as well as to data, tools, workflows and services that can be customized to suit researchers’ needs. At the heart of the proposal workflows will integrate with the EOSC-Core. Adoption, development and implementation of technologies to interoperate across services, will allow researchers to produce high-quality FAIR data, available to all in EOSC. Communities across disciplines -- Life Sciences, Climate and Biodiversity, Astrophysics, Materials science -- will demonstrate the bridge from EOSC's technical services to scientific analysis.

50

Smarter job scheduling | Paul De Geest

52 of 53

Introduction and Overview

51

Smarter job scheduling | Paul De Geest

53 of 53

Introduction and Overview

52

Smarter job scheduling | Paul De Geest