Cache friendly job scheduling

Ilija Vukotic

University of Chicago

XRootD Workshop

11-12 June 2019

CC-IN2P3 Lyon

When & Why is it needed?

Not for Analysis centar/Tier3 local XCache. They already have a set of users gravitating to it and their cache hit rate should be substantial unless your site is unlucky and you have a Minimum Bias physics group to support.

T1 and T2s get jobs based on the data they have. Thought experiment - imagine all non-tape storage is XCache. How do you schedule jobs if no one has a disk replica. Sending a job to a place with a free CPU guarantees data always moving around and rarely being found in a cache.

Cache aware scheduling

Cache efficiency in current setup limited by size of the working data set and the fact that almost all sites accept all task types.

Further complication are reprocessing campaigns that would flush caches with data not to be used again for months.

Several possible approaches:

  • Pinning datasets to sites. Can be done on dataset creation or first access.
  • Keeping track on last place dataset was used at.
  • RUCIO volatile storage elements.
  • ...

3

Virtual Placement model

On registration in RUCIO every dataset gets assigned to N sites it the same cloud. Assignment is done randomly where each sites probability to get the dataset is proportional to fraction of CPUs that the site contributes to ATLAS. Datasets are not actually copied at all at these 3 sites but only exist in the “lake”.

Panda would assign job that needs as an input this dataset to the first site from these 3. In case site is in outage it would get assigned to the second site from the list. Once job is there it would access the data through the cache.

This way we get:

  • potentially simpler scheduling
  • very high cache hit rate
  • we could use whole of the existing storage as caches
  • reliability
  • adding/removing site would be easily done from a central location
  • no need for FTS (except maybe between lakes), most of RUCIO rules

Emulation of whole grid in last 5 months

Whatever we do changes required will be substantial so we want to know in advance that it will work. To make sure of it we need to emulate different model at all the grid and 5 months scale. While simplifications have to be made, it has to have all the scales correct.

Data needed:

To get it read:

  • JEDI tasks table - filter (status:done, NOT site:super comp) get (taskid , creation times, end time, processing task)
  • Job table -filter (jobstatus:finished) get (number of jobs, input dataset, sum of (jobs durations, actual cores used, input files))
  • RUCIO ( input ds: files, ds size, data type)

  • Task creation & end time, Input DS (name, size, nfiles), n jobs, avg. actual cores used, avg wall time
  • Clouds, sites per cloud, n cores per site

Tested models

Model

SEs

Cache levels

Task types

V1

All / 2 / 3

single

prod

V2

All / 3

L2 cache

prod

V3

All / 2 / 3

single

All

V4

All/3

L2 cache

All

All the code and data can be found HERE.

Constant across all the tests

  • Site level cache size is: 100TB XCache per 1000 cores
  • If present cloud level caches sizes are 100TB per 2000 cores in the cloud
  • ⅓ Gbps per job read rate - order of magnitude too much ...

In reserve slides

ALL tasks

One Data Lake (has all the data)

Each site has XCache

Curiously, more options means longer queue times!

All sites

2 sites

V3

ALL tasks

One Data Lake (has all the data)

Each site has XCache

V3

All sites

2 sites

ALL tasks

One Data Lake (has all the data)

Each site has XCache

More options for VP makes site cache miss more probable and increase Data Lake throughput needed.

V3

All sites

2 sites

This much throughput can’t be provided by a single site with tape.

ALL tasks

One Data Lake (has all the data)

Each site has XCache

Site level caches ingress needs would be acceptable.

As expected egress (LAN) needs would be substantial.

V3

BNL cache

MWT2 cache

Quality of scheduling

Two important measures:

  • Available resources are used as completely as possible.
  • Time To Complete (TTC) - defined as time from task submission to task completion. We want average TTC as short as possible. There are some time-sensitive tasks (usually analysis ones).

Improved scheduling…

...jobs needing no input go to unused cores

Emulation slower 9%

longer TTC

Emulation slower 2.1x

longer TTC

Improved scheduling…

...jobs needing no input go to unused cores

Very close result. Could go to 1 site but that removes redundancy.

Conclusions

VP model could work (Optimally use CPU & Disk & Network) assuming XCache can deliver performance (per node) and origins are highly available.

It would be relatively straightforward to modify current systems to support it. It does not rely on any “smart” algorithms to be developed.

Would significantly simplify ADC operations.

Future

  • Get VP scheduling in production at several sites
  • Run it for a few months, check simulation result
  • Run A/B tests and compare
    • Job failure rate
    • Job CPU efficiency
    • Bandwidth usage
  • Far future
    • Proper integration in RUCIO
    • Better than LRU caching algo (eg. online reinforcement learning based).

Reserve

Simulation I

  • It is dangerous to look at 2-3 months. Strong seasonality, reprocessing campaigns change results a lot.
  • Production job inputs are slightly more cacheable (52% accesses and 67% data volume) than Analysis inputs (35% accesses and 37% data volume). So roughly only half of you data would come from cache.
  • Different file types have very different access patterns (eg. HITS, EVNT, payload files are very cacheable, DAODs, panda*, AODs less so).
  • 2nd level cache would not help as most datasets exist at only one site (we have replication factor of ~1.1)

If we had a 100TB cache at MWT2 and we run all the same jobs as we actually had from Aug 2018-Nov 2018. Keep in mind that these are not random jobs but jobs that already had data @MWT2. If replaying all jobs, results would be much worse.

Production tasks

One Data Lake (has all the data)

Each site has XCache

More options for VP makes queue times shorter.

Not much difference between 2 and 3 VPs.

All sites

3 sites

2 sites

V1

Production tasks

One Data Lake (has all the data)

Each site has XCache

Cache delivery rate very acceptable in all options.

More options for VP makes site cache miss more probable.

All sites

3 sites

2 sites

V1

Production tasks

One Data Lake (has all the data)

Each cloud has XCache (100TB/2k cores)

Each site has XCache

Will Cloud level cache reduce Data Lake throughput needs?

V2

All sites

3 sites

Yes it will!

What about Time To Complete tasks (TTC)

Emulation slower 30%

longer TTC

Emulation slower 4.3x

longer TTC

Still regular scheduling algorithm

Every time different cloud gets stuck with much more queued jobs?

Cache friendly job scheduling - Lion - Google Slides