2 of 16

Idea

Demo high throughput xcache service on SLATE platform at two nodes.

Cloud based clients read from ATLAS “lake” through cache.

Payload data are replay of MWT2 production accesses in August.

3 of 16

Steps

List all the files that were accessed by ATLAS production jobs at MWT2 in August.
Use RUCIO to get remote access paths to all of them (preferably from US sites)
Store all this data in Elasticsearch at UChicago
Create and run REST interface that will run at UChicago. On request it delivers stress test paths and collect data on success/failure and rate.
Spin up a k8s cluster in Google and Azure clouds.
Create a deployment with n pods that will receive stress test paths from UC service, and xrdcp all these files through the cache boxes at SC to /dev/null.
We try to run it until one of these happens:

Xcache server code breaks
Remote connection to some ATLAS site get saturated
Disks in SC cache nodes break/saturate
NICs in SC cache nodes saturate
Something unexpected breaks 😊

We collect data on amounts of data transferred, rates observed (to clients, to cache servers), success/failure ratio, cache hit rate ,…

4 of 16

Steps 1,2,3 - stress data

Source of the data accessed at MWT2 are Rucio traces. These are collected at CERN and reindexed in ES at MWT2.

Python code does ES traces lookups, only data that was used as an input to our production queues. Rucio API was then used to find replicas of the files at US sites and accessible using xrootd. Full paths to selected replicas are stored into a new ES index (stress).

One can tune ratio of files coming from different sites by indexing specifically from a given site or removing already indexed documents.

5 of 16

Step 4 - REST interface to stress data

As stress tests can run anywhere it is not convenient to open ES firewall for the stress test access to data. So we created a Node.js server with a REST interface that delivers paths to the files to be accessed, receives results on the tests and updates ES.

The server runs at UChicago K8s cluster and serves on: xcache.mwt2.org

Endpoints are:

GET /stress_test
GET /stress_result/$id/$return_code/$duration

6 of 16

Step 5 - spin up clients in clouds

We used clients in 3 different clouds - Google compute engine (GCE), Microsoft Azure, and Amazon AWS. For the SC18 we used only GCE and Azure as these are fastest to spin up/scale down.

Clients are requesting a stress test path, prepending it with the xcache address, and xrdcp-ing it to /dev/null. Measured time to transfer and return code are sent back to ES through the XCache service. Stress test client’s image is the same as the XCache server one, and deployment can be found in the same GitHub repository.

7 of 16

Step 5- GCE

K8s 1.10
g1-small (1 vCPU, 1.7 GB memory)
Us-west1-a
Monitoring using Stackdriver

Azure

8 of 16

Step 5 - Azure

K8s 1.11.3
West US
Standard A1 (1 vcpu, 1.75 GB memory)
Monitoring using Insights

9 of 16

Step 8 - Monitoring

In addition to cloud based client monitoring we have two other ways to monitor xcache stress performance:

By looking at client reported numbers
XCache deployment comes with a container that periodically reports data from XCache metadata (.cinfo)
Both of these are stored in ES at UC.

XCache server reported

10 of 16

Step 8

Monitoring cont’

XCache clients reported

11 of 16

Storage and deployment

50TB RADOS Block Device provisioned from OSiRIS storage on the floor at SC
Storage configured in a Ceph cache tier.
RBD mounted to 100Gbps DTN
Kubernetes deployed on both DTN and Ceph storage node
Cluster registered with, and software deployed through, SLATE:

slate app install --dev --vo atlas-xcache --cluster um-sc18 --conf xcache.yaml xcache

12 of 16

Second-level Caching mechanisms

100Gbps DTN node (nvm01) has four NVMes in it. Wanted to take advantage of them!

Samsung 960 Pro 1TB
Specs claim up to 3.5GB/s reads per NVMe

Tried RAIDing the NVMes together with both ZFS and mdraid

Performance similar, only able to read around ~3.5GB/s *in raid 0*
Very small cache without the RBD - only 4TB. Cache hits unlikely

Tried two different methods of attaching the cache to the backing store

dm-cache, via LVM
ZFS w/ L2ARC + Intent Log

ZFS performed better than dm-cache

13 of 16

ZFS + RBD

Created ZFS filesystem on top of RADOS Block Device to take advantage of built-in SSD caching features
3 SSDs attached as “L2ARC” (Level 2 Adaptive Replacement Cache)

improves reads

1 SSD attached as ZIL (ZFS Intent Log)

improves writes

As the RBD is a network block device, ZFS needed various tunings:

Pool needed alignment shift (4k sectors), record size (1MB), no atime / directory atime
In the kernel driver

we increased the # of parallel IOs to the block device by a 5-10x to take advantage of Ceph’s concurrency
L2ARC max write speed increased from 8MB/s -> 512MB/s to fill the cache faster

14 of 16

ZFS Performance

L2ARC max write speed capped to 64MB/s per device

L2ARC max write increased to 750MB/s per device.

Thrashing?

15 of 16

Network

Static routes were created for MWT2 and AGLT2 storage, such that traffic flowed through the “milr” interface back to AGLT2
Traffic to Cloud providers went through the “scinet” interface
SWT2, SLAC were added but static routes were not set up - traffic flowed through the scinet interface

16 of 16

Results & conclusions

Stressing test from cloud very convenient ( total cost: Azure 57$ (everyone gets 200$), GCE: 96$)
XCache server

Stable
Reached 2GB/s when stressed with 2*30 clients
Could not write down all the data that came through it. Discussing things with Matevz & Alja
Will have to find best configuration for the server.

Still long way to go to filling up 100Gbs NIC