1 of 53

"Smarter" job scheduling using (user) provided infrastructure

Grant agreement 101057388

2 of 53

EuroScienceGateway

Leveraging the European compute infrastructures for

data-intensive research guided by FAIR principles

1

Smarter job scheduling | Paul De Geest

3 of 53

Institutions involved

2

Smarter job scheduling | Paul De Geest

4 of 53

Motivation

3

National Cloud and HPC infrastructures have been established, with differences in

  • Hardware
  • Configuration
  • Software stack
  • Authentication and Authorization
  • Access typically targeted at local researchers

goal: provide efficient and structured access to data, tools and workflows supported by suitable IT infrastructures.

Smarter job scheduling | Paul De Geest

5 of 53

The Pulsar Network

4

The Pulsar Network is a distributed job execution system, allowing to scale the computing resources available to Galaxy instances over heterogeneous and distributed compute facilities.

  • 6 national Galaxy instances will make use of the Pulsar Network
  • At least 10 Pulsar trusted endpoints, routing the incoming jobs from Galaxy and other workflow management systems to local compute resources

Smarter job scheduling | Paul De Geest

6 of 53

User provided infrastructure

Bring Your Own Compute

  • Allow users to add new, external Pulsar endpoints

5

Smarter job scheduling | Paul De Geest

7 of 53

User provided infrastructure

Bring Your Own Compute

  • Allow users to add new, external Pulsar endpoints
  • User preference form to add new Pulsar endpoint to Galaxy server

6

Smarter job scheduling | Paul De Geest

8 of 53

User provided infrastructure

Bring Your Own Compute

  • Allow users to add new, external Pulsar endpoints
  • User preference form to add new Pulsar endpoint to Galaxy server
  • User preference form to select preferred Pulsar endpoint

7

Smarter job scheduling | Paul De Geest

9 of 53

User provided infrastructure

Bring Your Own Storage

8

Smarter job scheduling | Paul De Geest

10 of 53

Smart job-scheduling

9

How do we efficiently schedule jobs from any UseGalaxy.* server to any Pulsar endpoint in the Pulsar network or a user-defined compute endpoint?

Smarter job scheduling | Paul De Geest

11 of 53

Set up

10

Smarter job scheduling | Paul De Geest

12 of 53

Set up

11

TPV

Meta-scheduler

Job Visualization

Smarter job scheduling | Paul De Geest

13 of 53

Set up

12

TPV

Meta-scheduler

Job Visualization

BYOC

BYOS

Smarter job scheduling | Paul De Geest

14 of 53

Time-series Database

13

Central database for collecting time-series info about all Pulsar destinations in the network:

  • Destination-tool related metrics from the POV of each of the Galaxy servers with access to the Pulsar network
  • (Available) resource info from each of the Pulsar destinations

Smarter job scheduling | Paul De Geest

15 of 53

Current and aggregate stats

14

Median job count

Median destination-tool queue/run times

Median Pulsar Load

Smarter job scheduling | Paul De Geest

16 of 53

Total Perspective Vortex

15

  • Initial matchmaking between the job requirements and the available compute endpoints
  • Access to Galaxy db and other static info :
    • queue state of the Pulsar endpoints
    • dataset size, tool id
    • objectstore and pulsar endpoint geolocation

Smarter job scheduling | Paul De Geest

17 of 53

Total Perspective Vortex

16

  • Initial matchmaking between the job requirements and the available compute endpoints
  • Access to Galaxy db and other static info :
    • queue state of the Pulsar endpoints
    • dataset size, tool id
    • objectstore and pulsar endpoint geolocation

Smarter job scheduling | Paul De Geest

18 of 53

Meta-scheduling

Central API endpoint

17

Implements matchmaking logic, ranking available pulsar destinations based on:

  • current - and historically collected (tool-destination) metrics
  • data locality (geolocation) of the objectstore/destination
  • dataset size

Smarter job scheduling | Paul De Geest

19 of 53

Meta-scheduling

Central API endpoint

18

Implements matchmaking logic, ranking available pulsar destinations based on:

  • current - and historically collected (tool-destination) metrics
  • data locality (geolocation) of the objectstore/destination
  • dataset size

Smarter job scheduling | Paul De Geest

20 of 53

Meta-scheduling

Central API endpoint

19

Implements matchmaking logic, ranking available pulsar destinations based on:

  • current - and historically collected (tool-destination) metrics
  • data locality (geolocation) of the objectstore/destination
  • dataset size

Currently, simple weighting of the different metrics

Working on add more advanced Fuzzy/Adaptive-based matchmaking algorithms

Smarter job scheduling | Paul De Geest

21 of 53

Set up

20

TPV

Meta-scheduler

Job Visualization

BYOC

BYOS

Smarter job scheduling | Paul De Geest

22 of 53

Visualisation

21

Smarter job scheduling | Paul De Geest

23 of 53

Future Work

Enhancing data locality and performance by unifying multiple (user-) object stores into a single OneData object store

22

Smarter job scheduling | Paul De Geest

24 of 53

Thanks!

All ESG members and specifically

23

Smarter job scheduling | Paul De Geest

25 of 53

Institutions involved

24

Smarter job scheduling | Paul De Geest

26 of 53

The Pulsar Network

25

The Pulsar Network is a distributed job execution system, allowing to scale the computing resources available to Galaxy instances over heterogeneous and distributed compute facilities.

  • 6 national Galaxy instances will make use of the Pulsar Network
  • At least 10 Pulsar trusted endpoints, routing the incoming jobs from Galaxy and other workflow management systems to local compute resources

Smarter job scheduling | Paul De Geest

27 of 53

Smart job-scheduling

26

  • Schedule jobs from any UseGalaxy.* server to any Pulsar endpoint in the Pulsar network or a user-defined compute endpoint
  • Central InfluxDB to collect current status and statistics of the available compute endpoints
  • Standalone API endpoint that processes these statistics using several algorithms to decide on an ordered list of destinations:
      • Fuzzy-based matchmaking comparing job requirements with historically collected resource information in addition to locality based preemption
      • Adaptive-based matchmaking comparing full historically collected resource information in addition to locality based preemption

Smarter job scheduling | Paul De Geest

28 of 53

Smart job-scheduling

Add some in practice example of the workflow (tpv ranking function, example statistics we collect, API?, example algorithms in the API?)

27

Smarter job scheduling | Paul De Geest

29 of 53

Smart job-scheduling

28

Smarter job scheduling | Paul De Geest

30 of 53

Smart job-scheduling

29

Data collection

Smarter job scheduling | Paul De Geest

31 of 53

Smart job-scheduling

  • Extending Galaxy’s scheduling with
    • Latitude and longitude for object stores
    • Latitude and longitude for compute destinations

  • Goal: enable allocation of workflows near the data

Data locality

30

Smarter job scheduling | Paul De Geest

32 of 53

Smart job-scheduling

Metascheduling

  • Analyzing the available options
    • Extend DIRAC, the EGI Workload Manager service in the High Throughput Computing grid
    • Implement a standalone API endpoint from scratch
  • Testing
    • Road testing the DIRAC Pilot Factory
    • Evaluating two metascheduling algorithms with the InterGridSim Simulator
      • Fuzzy-based matchmaking comparing job requirements with historically collected resource information in addition to locality based preemption
      • Adaptive-based matchmaking comparing full historically collected resource information in addition to locality based preemption

31

Smarter job scheduling | Paul De Geest

33 of 53

Motivation

Galaxy can be deployed on a laptop

32

Smarter job scheduling | Paul De Geest

34 of 53

Motivation

Galaxy can be deployed on top of a large compute cluster

33

Smarter job scheduling | Paul De Geest

35 of 53

Bring Your Own Storage (BYOS)

Existing approach and way forward

  • Limitations of existing storage options
    • The storage needs to be provided by the admin
    • The service provider needs to sustain the storage
    • Storage is limited
    • Storage does not scale nicely with growing number of users

  • Proposal
    • Empower users to bring their own storage

34

Smarter job scheduling | Paul De Geest

36 of 53

Bring Your Own Storage (BYOS)

Galaxy File Source Plugin added for Onedata

35

Smarter job scheduling | Paul De Geest

37 of 53

Bring Your Own Storage (BYOS)

Bring your own Object Storage via S3

36

Smarter job scheduling | Paul De Geest

38 of 53

Bring Your Own Storage (BYOS)

Bring your own Object Storage via S3

  • Secrets stored in Galaxy's vault
  • This vault can be access by jobs (e.g. to push data back to an object store from a Pulsar endpoint)

37

Smarter job scheduling | Paul De Geest

39 of 53

Bring Your Own Storage (BYOS)

Own storage as default one

38

Smarter job scheduling | Paul De Geest

40 of 53

Bring Your Own Storage (BYOS)

Choose preferred storage per history, workflow and tool

39

Smarter job scheduling | Paul De Geest

41 of 53

40

Deliverable/Milestone

Due date

Verification Method

Progress Status (%)

D4.1 Bring Your Own Infrastructure (compute,

storage) Demonstrator

31-Aug-2024

Report

50%

D4.2 Publication on the smart job scheduler

implementation

28-Feb-2025

publication

20%

M4.1 BYOC and BYOS integrated into ESG

31-Aug-2023

Software available

100%

M4.2 Meta-scheduler model for job optimisation

available

28-Feb-2024

Software available

30%

Smarter job scheduling | Paul De Geest

42 of 53

Collaborations & Future Work

Björn Grüning (ALU)

41

Smarter job scheduling | Paul De Geest

43 of 53

Collaborations & Future Work

ESG/Galaxy and related projects in previous EOSC projects

42

Smarter job scheduling | Paul De Geest

44 of 53

Collaborations & Future Work

ESG/Galaxy and related projects in previous EOSC projects

43

Smarter job scheduling | Paul De Geest

45 of 53

Collaborations & Future Work

  • EMBL/EBI BioModels “Run in Galaxy” Button
  • OSCARS - Open Science Cluster’ Action for Research & Society
  • EMBL/EBI MGnify pipelines in Galaxy
  • EBP assembly and genome annotation

additional projects

44

BioModels

Smarter job scheduling | Paul De Geest

46 of 53

Collaborations & Future Work

  • Galaxy Community Conference in the Czech Republic
  • Bring your own Storage (BYOS)
  • Bring your own Compute (BYOC)
  • A global network of resources and researchers sharing tools, workflows and data

The next year!

45

BYOS

BYOC

Smarter job scheduling | Paul De Geest

47 of 53

48 of 53

47

49 of 53

48

50 of 53

49

51 of 53

Introduction and Overview

EuroScienceGateway will leverage a distributed computing network across 13 European countries, accessible via 6 national, user-friendly web portals, facilitating access to compute and storage infrastructures across Europe as well as to data, tools, workflows and services that can be customized to suit researchers’ needs. At the heart of the proposal workflows will integrate with the EOSC-Core. Adoption, development and implementation of technologies to interoperate across services, will allow researchers to produce high-quality FAIR data, available to all in EOSC. Communities across disciplines -- Life Sciences, Climate and Biodiversity, Astrophysics, Materials science -- will demonstrate the bridge from EOSC's technical services to scientific analysis.

50

Smarter job scheduling | Paul De Geest

52 of 53

Introduction and Overview

51

Smarter job scheduling | Paul De Geest

53 of 53

Introduction and Overview

52

Smarter job scheduling | Paul De Geest