1 of 18

Using XCaches

Ilija Vukotic, University of Chicago

XRootD Workshop

11-12 June 2019

CC-IN2P3 Lyon

2 of 18

Current ATLAS distributed data mgt

Almost all ATLAS computing sites (~130) have local storage (DDM endpoint).
Data distribution among DDM endpoints is orchestrated by RUCIO. Major factor in making decision is current distribution of available storage space.
Data is transported to DDM endpoints using FTS.
Jobs get scheduled only at computing centers that have input data.
WFMS (Workflow management system) is mostly unaware of the networking.

* Exceptions are two sites that are actually composed of several nearby sites and only one of them has storage.

** The only exception are special Hospital queues that do <0.5% of jobs.

3 of 18

ATLAS use cases

Grid jobs

Production - centrally managed tasks, frequently composed of hundreds of thousands jobs.
Analytics - user submitted grid jobs. Ranging from several jobs running over GBs of data, to 100k jobs running over PBs of data.

Local batch jobs

Analysis centers and Universities have batch queues used by physicists to do end-analysis. Fast turnaround, very spiky loads. Usually local storage that is hard to manage and fair-share.

4 of 18

Future

Working dataset will increase in size unlike “hot” disk capacity.

Data Lake - fewer bigger storages.

Higher usage of colder storages.

Reduce manpower needed while maintaining or increasing utilization.

New use cases on the horizon: low latency, high throughput, subfile, intelligent data delivery services (ESS/ServiceX/iDDs/...).

If and how caches will be used is still a matter of debate, but deemed essential part of future DDM.

5 of 18

Keys to successfully using (X)Cache*

Appropriate hardware is deployed (storage space and performance).
All origin servers highly available and performant.
Very low needs for manual interventions and/or cache wipes.
Ensure high cache hit rate

Cache aware data placement
Potentially better than LRU caching algorithms (based on filenames)

Deployed only at places that provenly benefit from it

Reduced throughput on WAN
Shorter jobs wall-time, shorter task time-to-completion
Reduction of storage, FTS, RUCIO support workload

Any resource that affects more than local operation must be centrally managed

* Assuming stable and performant operation of cache itself

6 of 18

Cache aware scheduling

Cache efficiency in current setup limited by size of the working data set and the fact that almost all sites accept all task types.

Further complication are reprocessing campaigns that would flush caches with data not to be used again for months.

Several possible approaches:

Pinning datasets to sites. Can be done on dataset creation or first access.
Keeping track on last place dataset was used at.
RUCIO volatile storage elements.
...

7 of 18

Central management

Previous experience told us that our software change way to quickly for the large locally managed distributed deployments.

Solution is to have a small group of service experts that fully manage the deployments.

With the rise of kubernetes, and platforms like SLATE, that is easily done.

Initial setup of k8s and SLATE platform is quite straightforward, and much easier than setting up, configuring and managing even just one service (eg. xcache). With more services deployed in this way, cost of k8s and SLATE is further amortized.

While ATLAS will have some locally managed caches, most of them will be centrally managed.

8 of 18

Current XCaches - centrally managed

Deployed using SLATE at AGLT2 and MWT2.

Using SLATE recommended hardware (HDD).

SWT2 & NET2 still waiting for hardware.

Storage in JBOD config. No pre-caching or prefetching. LRU clean up model.

Used for special “Hospital” queues. Low stress level (up to 250 cores). All accesses are done by Pilot - does simple xrdcp and check adler32 cksums. Completely cache unaware job scheduling. It takes weeks fill up and months for xcache content to reach steady state.

Limited set of origins. Configured in AGIS (currently BNL, TRIUMF, Victoria, AGLT2, MWT2).

9 of 18

Performance

Stable. Discounting for SWT2, bad transfer checksums at permill level.

Since this is a cache unaware scheduling, cache hit rate is almost negligible.

Unexpectedly, load is quite spiky.

With 2 WQ threads and 10 WQ blocks/thread, cached file sparseness is at 1% level. Could be very different in realistic cache operation where we expect ~ 2 times more reads than writes to the disk.

10 of 18

Current XCaches - centrally managed

ESNet node in Sunnyvale.

24TB SSDs on a high performance node with 100Gbps NIC.

Only stress tested. Sustained 17Gbps ingress from ATLAS US sites and 20Gbps egress to Google cloud clients. Probably could do more, one limitation is that GCE nodes are limited to 1440 byte frames.

11 of 18

Current XCaches - privately managed

Tier2 @ Birmingham

“Diskless” site. Relies on storage from nearby Manchester Tier2. Hoping to reduce bandwidth needs and keep things running when connection goes down.
Took few days to configure XCache. Two months later still no jobs processed using cache. Getting everything set up correctly in Rucio & AGIS takes time and experience.

Analysis centers (BNL, SLAC)

To speed up users analysis and provide management-free storage.
50TB NVMe storage, 80Gbps NIC. Managed to deliver 50Gbps to stress-test clients running in Google Cloud.

Privately managed places have no automatic AGIS (de)activation and not included in central monitoring.

12 of 18

Monitoring XCache usefulness

We are interested in:

Job failure rate due to input data access. We want it less or equal to “regular” jobs.
If input data in cache should have equal or better CPU efficiency, if not, CPU eff. penalty should be < 10%.
Meaningful decrease in WAN bandwidth utilization.

While we collect all this data, it is possible to determine xcache usefulness only with the WFMS scheduling jobs in a xcache aware way, and then doing A/B test.

13 of 18

Near future

Cache aware brokering (VP).

Adding more sites (SWT2 & NET2).

Restart automatic validation of origin servers.

Gradually increase percentage of jobs using xcache.

A/B testing.

Long term:

Rucio integrated VP.

Multi-node xcache sites.

14 of 18

RESERVE slides

15 of 18

A way to do it

Smaller sites will go storageless

If well connected to nearby site.
If large RTT or insufficient bandwidth place xcache at the endpoint.

Larger sites move part of storage to xcache storage. Slowly increase proportion of xcache until all non-lake storage is xcache. This is still an idea that is not tested in real life. It takes a lot of time and effort to change system that took decade to build.

16 of 18

VP - Virtual Placement

On registration in RUCIO every dataset gets assigned to N sites in the same cloud.
Assignment is done randomly where each sites probability to get the dataset is proportional to fraction of CPUs that the site contributes to ATLAS.
Datasets are not actually copied at all at these 3 sites but only exist in the “lake”.
Panda would assign job that needs as an input this dataset to the first site from these 3. In case site is in outage it would get assigned to the second site from the list. Once job is there it would access the data through the cache.

This way we get:

Simplified scheduling
Very high cache hit rate
We could use a large fraction of the existing storage as caches
Reliability
Adding/removing site would be easily done from a central location
Less stress on FTS, fewer RUCIO rules (neither needed at xcache-only sites)

17 of 18

VP - expectations

VP to two sites of same cloud

One Data Lake (has all the data)

Each cloud has XCache (100TB/2k cores)

Each site has XCache (100TB/1k cores)

Resources would be fully used. TTC would be the same as now. Cache would deliver 80% of data. Throughput at caches would be reasonable.

18 of 18

2nd level xcache

Used only when overflowing to second/third choice site.

Could be in-network.
Large.
High throughput.
Rather low cache hit rate.