1 of 16

XCache at SC18

Ilija, Lincoln, Shawn

2 of 16

Idea

Demo high throughput xcache service on SLATE platform at two nodes.

Cloud based clients read from ATLAS “lake” through cache.

Payload data are replay of MWT2 production accesses in August.

3 of 16

Steps

  • List all the files that were accessed by ATLAS production jobs at MWT2 in August.
  • Use RUCIO to get remote access paths to all of them (preferably from US sites)
  • Store all this data in Elasticsearch at UChicago
  • Create and run REST interface that will run at UChicago. On request it delivers stress test paths and collect data on success/failure and rate.
  • Spin up a k8s cluster in Google and Azure clouds.
  • Create a deployment with n pods that will receive stress test paths from UC service, and xrdcp all these files through the cache boxes at SC to /dev/null.
  • We try to run it until one of these happens:
    • Xcache server code breaks
    • Remote connection to some ATLAS site get saturated
    • Disks in SC cache nodes break/saturate
    • NICs in SC cache nodes saturate
    • Something unexpected breaks 😊
  • We collect data on amounts of data transferred, rates observed (to clients, to cache servers), success/failure ratio, cache hit rate ,…

4 of 16

Steps 1,2,3 - stress data

Source of the data accessed at MWT2 are Rucio traces. These are collected at CERN and reindexed in ES at MWT2.

Python code does ES traces lookups, only data that was used as an input to our production queues. Rucio API was then used to find replicas of the files at US sites and accessible using xrootd. Full paths to selected replicas are stored into a new ES index (stress).

One can tune ratio of files coming from different sites by indexing specifically from a given site or removing already indexed documents.

5 of 16

Step 4 - REST interface to stress data

As stress tests can run anywhere it is not convenient to open ES firewall for the stress test access to data. So we created a Node.js server with a REST interface that delivers paths to the files to be accessed, receives results on the tests and updates ES.

The server runs at UChicago K8s cluster and serves on: xcache.mwt2.org

Endpoints are:

  • GET /stress_test
  • GET /stress_result/$id/$return_code/$duration

6 of 16

Step 5 - spin up clients in clouds

We used clients in 3 different clouds - Google compute engine (GCE), Microsoft Azure, and Amazon AWS. For the SC18 we used only GCE and Azure as these are fastest to spin up/scale down.

Clients are requesting a stress test path, prepending it with the xcache address, and xrdcp-ing it to /dev/null. Measured time to transfer and return code are sent back to ES through the XCache service. Stress test client’s image is the same as the XCache server one, and deployment can be found in the same GitHub repository.

7 of 16

Step 5- GCE

  • K8s 1.10
  • g1-small (1 vCPU, 1.7 GB memory)
  • Us-west1-a
  • Monitoring using Stackdriver

Azure

8 of 16

Step 5 - Azure

  • K8s 1.11.3
  • West US
  • Standard A1 (1 vcpu, 1.75 GB memory)
  • Monitoring using Insights

9 of 16

Step 8 - Monitoring

In addition to cloud based client monitoring we have two other ways to monitor xcache stress performance:

  • By looking at client reported numbers
  • XCache deployment comes with a container that periodically reports data from XCache metadata (.cinfo)
  • Both of these are stored in ES at UC.

XCache server reported

10 of 16

Step 8

Monitoring cont’

XCache clients reported

11 of 16

Storage and deployment

  • 50TB RADOS Block Device provisioned from OSiRIS storage on the floor at SC
  • Storage configured in a Ceph cache tier.
  • RBD mounted to 100Gbps DTN
  • Kubernetes deployed on both DTN and Ceph storage node
  • Cluster registered with, and software deployed through, SLATE:
    • slate app install --dev --vo atlas-xcache --cluster um-sc18 --conf xcache.yaml xcache

12 of 16

Second-level Caching mechanisms

  • 100Gbps DTN node (nvm01) has four NVMes in it. Wanted to take advantage of them!
    • Samsung 960 Pro 1TB
    • Specs claim up to 3.5GB/s reads per NVMe
  • Tried RAIDing the NVMes together with both ZFS and mdraid
    • Performance similar, only able to read around ~3.5GB/s *in raid 0*
    • Very small cache without the RBD - only 4TB. Cache hits unlikely
  • Tried two different methods of attaching the cache to the backing store
    • dm-cache, via LVM
    • ZFS w/ L2ARC + Intent Log
  • ZFS performed better than dm-cache

13 of 16

ZFS + RBD

  • Created ZFS filesystem on top of RADOS Block Device to take advantage of built-in SSD caching features
  • 3 SSDs attached as “L2ARC” (Level 2 Adaptive Replacement Cache)
    • improves reads
  • 1 SSD attached as ZIL (ZFS Intent Log)
    • improves writes
  • As the RBD is a network block device, ZFS needed various tunings:
    • Pool needed alignment shift (4k sectors), record size (1MB), no atime / directory atime
    • In the kernel driver
      • we increased the # of parallel IOs to the block device by a 5-10x to take advantage of Ceph’s concurrency
      • L2ARC max write speed increased from 8MB/s -> 512MB/s to fill the cache faster

14 of 16

ZFS Performance

L2ARC max write speed capped to 64MB/s per device

L2ARC max write increased to 750MB/s per device.

Thrashing?

15 of 16

Network

  • Static routes were created for MWT2 and AGLT2 storage, such that traffic flowed through the “milr” interface back to AGLT2
  • Traffic to Cloud providers went through the “scinet” interface
  • SWT2, SLAC were added but static routes were not set up - traffic flowed through the scinet interface

16 of 16

Results & conclusions

  • Stressing test from cloud very convenient ( total cost: Azure 57$ (everyone gets 200$), GCE: 96$)
  • XCache server
    • Stable
    • Reached 2GB/s when stressed with 2*30 clients
    • Could not write down all the data that came through it. Discussing things with Matevz & Alja
    • Will have to find best configuration for the server.
  • Still long way to go to filling up 100Gbs NIC