Using XCaches
Ilija Vukotic, University of Chicago
XRootD Workshop
11-12 June 2019
CC-IN2P3 Lyon
Current ATLAS distributed data mgt
2
* Exceptions are two sites that are actually composed of several nearby sites and only one of them has storage.
** The only exception are special Hospital queues that do <0.5% of jobs.
ATLAS use cases
3
Future
Working dataset will increase in size unlike “hot” disk capacity.
Data Lake - fewer bigger storages.
Higher usage of colder storages.
Reduce manpower needed while maintaining or increasing utilization.
New use cases on the horizon: low latency, high throughput, subfile, intelligent data delivery services (ESS/ServiceX/iDDs/...).
If and how caches will be used is still a matter of debate, but deemed essential part of future DDM.
4
Keys to successfully using (X)Cache*
5
* Assuming stable and performant operation of cache itself
Cache aware scheduling
Cache efficiency in current setup limited by size of the working data set and the fact that almost all sites accept all task types.
Further complication are reprocessing campaigns that would flush caches with data not to be used again for months.
Several possible approaches:
6
Central management
Previous experience told us that our software change way to quickly for the large locally managed distributed deployments.
Solution is to have a small group of service experts that fully manage the deployments.
With the rise of kubernetes, and platforms like SLATE, that is easily done.
Initial setup of k8s and SLATE platform is quite straightforward, and much easier than setting up, configuring and managing even just one service (eg. xcache). With more services deployed in this way, cost of k8s and SLATE is further amortized.
While ATLAS will have some locally managed caches, most of them will be centrally managed.
7
Current XCaches - centrally managed
Deployed using SLATE at AGLT2 and MWT2.
Using SLATE recommended hardware (HDD).
SWT2 & NET2 still waiting for hardware.
Storage in JBOD config. No pre-caching or prefetching. LRU clean up model.
Used for special “Hospital” queues. Low stress level (up to 250 cores). All accesses are done by Pilot - does simple xrdcp and check adler32 cksums. Completely cache unaware job scheduling. It takes weeks fill up and months for xcache content to reach steady state.
Limited set of origins. Configured in AGIS (currently BNL, TRIUMF, Victoria, AGLT2, MWT2).
8
Performance
Stable. Discounting for SWT2, bad transfer checksums at permill level.
Since this is a cache unaware scheduling, cache hit rate is almost negligible.
Unexpectedly, load is quite spiky.
With 2 WQ threads and 10 WQ blocks/thread, cached file sparseness is at 1% level. Could be very different in realistic cache operation where we expect ~ 2 times more reads than writes to the disk.
9
Current XCaches - centrally managed
ESNet node in Sunnyvale.
24TB SSDs on a high performance node with 100Gbps NIC.
Only stress tested. Sustained 17Gbps ingress from ATLAS US sites and 20Gbps egress to Google cloud clients. Probably could do more, one limitation is that GCE nodes are limited to 1440 byte frames.
10
Current XCaches - privately managed
Tier2 @ Birmingham
Analysis centers (BNL, SLAC)
Privately managed places have no automatic AGIS (de)activation and not included in central monitoring.
11
Monitoring XCache usefulness
We are interested in:
While we collect all this data, it is possible to determine xcache usefulness only with the WFMS scheduling jobs in a xcache aware way, and then doing A/B test.
12
Near future
Cache aware brokering (VP).
Adding more sites (SWT2 & NET2).
Restart automatic validation of origin servers.
Gradually increase percentage of jobs using xcache.
A/B testing.
Long term:
Rucio integrated VP.
Multi-node xcache sites.
13
RESERVE slides
14
A way to do it
15
VP - Virtual Placement
16
This way we get:
VP - expectations
17
VP to two sites of same cloud
One Data Lake (has all the data)
Each cloud has XCache (100TB/2k cores)
Each site has XCache (100TB/1k cores)
Resources would be fully used. TTC would be the same as now. Cache would deliver 80% of data. Throughput at caches would be reasonable.
2nd level xcache
Used only when overflowing to second/third choice site.
18