XCache at SC18
Ilija, Lincoln, Shawn
Idea
Demo high throughput xcache service on SLATE platform at two nodes.
Cloud based clients read from ATLAS “lake” through cache.
Payload data are replay of MWT2 production accesses in August.
Steps
Steps 1,2,3 - stress data
Source of the data accessed at MWT2 are Rucio traces. These are collected at CERN and reindexed in ES at MWT2.
Python code does ES traces lookups, only data that was used as an input to our production queues. Rucio API was then used to find replicas of the files at US sites and accessible using xrootd. Full paths to selected replicas are stored into a new ES index (stress).
One can tune ratio of files coming from different sites by indexing specifically from a given site or removing already indexed documents.
Step 4 - REST interface to stress data
As stress tests can run anywhere it is not convenient to open ES firewall for the stress test access to data. So we created a Node.js server with a REST interface that delivers paths to the files to be accessed, receives results on the tests and updates ES.
The server runs at UChicago K8s cluster and serves on: xcache.mwt2.org
Endpoints are:
Step 5 - spin up clients in clouds
We used clients in 3 different clouds - Google compute engine (GCE), Microsoft Azure, and Amazon AWS. For the SC18 we used only GCE and Azure as these are fastest to spin up/scale down.
Clients are requesting a stress test path, prepending it with the xcache address, and xrdcp-ing it to /dev/null. Measured time to transfer and return code are sent back to ES through the XCache service. Stress test client’s image is the same as the XCache server one, and deployment can be found in the same GitHub repository.
Step 5- GCE
Azure
Step 5 - Azure
Step 8 - Monitoring
In addition to cloud based client monitoring we have two other ways to monitor xcache stress performance:
XCache server reported
Step 8
Monitoring cont’
XCache clients reported
Storage and deployment
Second-level Caching mechanisms
ZFS + RBD
ZFS Performance
L2ARC max write speed capped to 64MB/s per device
L2ARC max write increased to 750MB/s per device.
Thrashing?
Network
Results & conclusions