JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 34

XCache

Deployment

Integration

Performance

Ilija Vukotic

December 4, 2018

WBS 2.3.5.4 Intelligent Data Delivery

2 of 34

Current situation

Several different groups involved in xrootd based caching

ECDF (WN cache, diskless site simulation) (link)
SLAC Wei
Italian cloud (simulation, planning to convert all sites to diskless)(presentation this morning)
German cloud ( simulation, understanding working data set )
MWT2, AGLT2 ( simulation, deployed caches in hospital queues)
French cloud ( should start simulation soon) (link)
BNL ( as we learned yesterday )

Non-xrootd based caching

David Cameron - ARC (in production) (link)
Irvin Umaña Chacon, David Smith - Prague (simulation) (link)
DPM/Dynafed (link1, link2)

3 of 34

Current understanding of data reuse

Production input are slightly more cacheable (52% accesses and 67% data volume) than Analysis inputs (35% accesses and 37% data volume).

Different file types have very different access patterns (eg. HITS, EVNT, payload files are very cacheable, DAODs, panda*, AODs less so).

As expected it rarely happens the same file is accessed at two different sites.

Claim: even a cache of 100TB per site would be sufficient to deliver roughly half or the accesses and data volume.

4 of 34

Current understanding of US scale cache

Multilayer cache wouldn’t significantly help.

Throughput generated between sites would be reasonable.

5 of 34

Can XCache deliver what is needed?

Can SLATE deliver what is needed?

Stability
Correctness
Throughput (from a reasonable hardware)

Servicing application without contacting sys admin

Deploy, Update, Remove
Monitoring, Inspection, Debugging

Try it and find out!

6 of 34

XCache TODO list from November 7th

SC2018 demo (findings)

Running XCache server (100Gbps NIC, NVMe storage)
Clients running in Azure, Google, AWS
Data coming from US sites
Pre-tested last week to ~20Gbps

Stress testing XCache nodes, determining optimal configuration
Get all US sites to deploy SLATE node (MWT2 and AGLT2 done)
Get additional data reported in RUCIO traces (pandaID)
Simulate XCache with separation of data formats (at US scale)
Get “hospital” queues configured and working
AGIS auto(de)activation.

DONE

7 of 34

Hospital queues

Queues that get jobs not accepted anywhere else. Configured to use remote SEs for input but all inputs go through local XCaches. Output the same as regular jobs.

Setup at MWT2 (production and analysis) and AGLT2 (production). Thanks Judith and Wenjing!

We “mounted” them with US DATADISKs. Not without issues:

BNL endpoint would randomly stop authenticating accesses.
NET2 endpoint stopped working and happened to be the only instance. REMINDER: all sites are obliged to provide a working xrootd endpoint.

8 of 34

XCaches for Hospital queues

MWT2

PowerEdge R740XD
8 x 8TB 7.2k RPM NLSAS
1 x Dell 1.6TB NVMe (for metadata)
10 Gbps NIC
2 x Xeon Silver 4116 384GB RAM
Possible hardware issue

AGLT2

Poweredge R740xd
8 x 8TB 7.2k RPM NLSAS
1 x Dell 1.6TB NVMe (for metadata)
1Gbps NIC connected
2x10Gbps waiting for switch
2x Intel 4116 CPUs, 384 GB ram

9 of 34

Monitoring

XCache reporter

Panda job reports

Rucio Traces

K8s monitoring

10 of 34

Monitoring

XCache reporter

Panda job reports

Rucio Traces

K8s monitoring

11 of 34

Monitoring

XCache reporter

Panda job reports

Rucio Traces

K8s monitoring

12 of 34

Monitoring

XCache reporter

Panda job reports

Rucio Traces

K8s monitoring

13 of 34

XCache findings

Stability

Delivered - we had zero instances of XCache server crashing

Correctness

Still not sure

At SC18 checksums not checked.
At MWT2 incorrect data in cache at higher loads but could be HW.
At AGLT2 still not enough load, will know next week.

Throughput (from a reasonable hardware)

Not there yet

@SC18 maxed at ~2GB/s even with SSDs
@MWT2 maxing out 10Gbps NIC, will need a “real” XCache node to test higher.

14 of 34

SLATE findings

Very easy to deploy, redeploy, remove application instance. I basically issue 3 commands and I’m done in 30 seconds flat:

I was able to completely manage AGLT2 xcache without Shawn’s or Wenjing’s involvement except initial SLATE and HW setup.

SLATE instance monitoring needs a bit of improvement (logs, per container/pod, instance metrics/events), but even without it completely usable for production level applications.

./slate instance list

./slate instance delete <instance name>

./slate app install --vo atlas-xcache --cluster uchicago-prod --conf MWT2.yaml xcache

15 of 34

Next steps

Get proper hardware in place at MWT2 and AGLT2.
Get SLATE deployed at all US T2s.
Get XCache accounting in order.
Retest correctness under heavy load.
Remeasure performance, find optimal configuration.
Simulate XCaches in “Lake aware” job scheduling model.
Test one site in diskless mode.

16 of 34

Reminder

slides

17 of 34

XCache simulation

All simulations:

Full file caching
Low water mark 85%
High water mark 95%
LRU - least recently used cached files are expunged first
Most numbers and plots can be found here.
All the code to reproduce the numbers can be found here. Currently ECDF and LRZ people are using it, and Chris Weaver (UC) is checking everything independently. Still new eyes are always welcome.

18 of 34

Access overlaps between sites

PRODUCTION
site1	site2	as percent of site1 unique files	as percent of site2 unique files	files accessed on both sites
MWT2	AGLT2	2.82%	11.21%	67538
MWT2	NET2	1.75%	4.03%	42032
MWT2	SWT2	5.74%	9.77%	137555
MWT2	BNL	9.62%	7.00%	230618
AGLT2	NET2	5.33%	3.08%	32117
AGLT2	SWT2	6.75%	2.89%	40668
AGLT2	BNL	15.80%	2.89%	95171
NET2	SWT2	2.06%	1.52%	21423
NET2	BNL	7.14%	2.26%	74427
SWT2	BNL	8.96%	3.83%	126087

ANALYSIS
site1	site2	as percent of site1 unique files	as percent of site2 unique files	files accessed on both sites
MWT2	AGLT2	0.79%	1.97%	26761
MWT2	NET2	0.58%	2.26%	19756
MWT2	SWT2	0.68%	1.27%	22974
MWT2	BNL	2.99%	1.32%	101525
AGLT2	NET2	1.00%	1.56%	13605
AGLT2	SWT2	1.09%	0.82%	14854
AGLT2	BNL	3.33%	0.59%	45354
NET2	SWT2	1.80%	0.87%	15727
NET2	BNL	6.87%	0.78%	60137
SWT2	BNL	3.14%	0.74%	56809

19 of 34

Let’s try simulate 2 layer cache - numbers

site	cleanups	avg. accesses	requests	cache hits	requested [TB]	delivered from cache [TB]	cache hits	cache data
xc_AGLT2	110	8.0	2350045	1305904	4553.6	2654.9	55.57%	58.30%
xc_BNL	149	3.3	9926097	5113342	18178.2	10600.2	51.51%	58.31%
xc_MWT2	301.5	5.9	8027849	4303754	13685.7	8720.7	53.61%	63.72%
xc_NET2	124.75	14.5	3549113	2144678	7307.7	5166.3	60.43%	70.70%
xc_SWT2	178	6.2	4275897	2233458	7542.2	4551.5	52.23%	60.35%
xc_Int2	237.8	1.1	13027865	731289	19573.8	1072.1	5.61%	5.48%
ORIGIN			12296576	12296576	18501.7	18501.7

Production Inputs, August and September

20 of 34

Let’s try simulate 2 layer cache - plots

Production Inputs, August and September

21 of 34

Let’s try simulate 2 layer cache - numbers

Analysis Inputs, August and September

site	cleanups	avg. accesses	requests	cache hits	requested [TB]	delivered from cache [TB]	cache hits	cache data
xc_AGLT2	70.75	1.8	2514129	938976	2336.8	1060.0	37.35%	45.36%
xc_BNL	174	1.9	15487492	5437455	14821.1	6038.6	35.11%	40.74%
xc_Int2	182.8	1.0	18878388	309758	14832.3	459.4	1.64%	3.10%
xc_MWT2	162.25	2.7	6868473	2695981	5099.0	2361.3	39.25%	46.31%
xc_NET2	26	2.0	1588378	632108	1005.4	446.3	39.80%	44.39%
xc_SWT2	83.5	1.9	3138866	1014430	2571.2	1095.1	32.32%	42.59%
ORIGIN			18568630	18568630	14372.9	14372.9

22 of 34

Let’s try simulate 2 layer cache - plots

Analysis Inputs, August and September

23 of 34

Adding 350 TB of cache to the system

4x10	MWT2, AGLT2, NET2, SWT2
4x30	BNL
5x100	Int2

9x10	MWT2, AGLT2, NET2, SWT2
9x30	BNL
5x30	Int2

24 of 34

What kind of traffic would XCache generate?

We assume each client consumes ⅓ of 1Gbps link (43MB/s). This is an upper limit of what our codes can read (uncompress) and ~2x of what is average read speed measured from LSM monitoring. We assumed enlarged T2 caches.

25 of 34

Simulated

Traffic 1

26 of 34

Simulated

Traffic 2

27 of 34

Simulated

Traffic 3

28 of 34

Simulated

Traffic 4

29 of 34

MWT2 traffic

All our dCache servers.

Some admixture of OSG jobs...

~ 5Gbps ingress (FTS ingress + job output)

~25 Gbps egress (FTS egress + job input)

30 of 34

MWT2 current FTS

Egress:

2.6 Gbps

Ingress:

3 Gbps

31 of 34

Containerized XCache service basics

XCache needs:

Large enough local disk (not NAS)
DTN-equivalent Network (10 Gbps minimum, recommended: 40Gbps )
Open port 1094
Robot certificate to authenticate itself to the ATLAS data service
Handful of config variables
User-parsable names eg. xcache.mwt2.org, or xcache.mwt2.slate, xcache.aglt2.slate

32 of 34

XCache related information

ATLAS information

General Information

33 of 34

Why?

We have a subscription that was defined more than 3 years ago to have a complete replicas of all DAODs to the US sites:

https://rucio-ui.cern.ch/subscription?name=DAODs%20to%20US%20T2%20DATADISK&account=ddmadmin https://its.cern.ch/jira/browse/ATLDDMOPS-5089

During the last days we accumulated a huge backlog of files to transfer on a few sites :

Nbfiles Bytes RSE

7705227 270458200008683 SLACXRD_DATADISK

542134 127768052728015 MWT2_DATADISK

498550 161189127282917 NET2_DATADISK

466033 162974407287848 AGLT2_DATADISK

These transfers compete with other transfers like Production Input or T0 export.

Is this subscription still needed taking into account that a huge volume of DAODs are never touched ? Or can we disable it and think about a smarter placement (e.g. only send the an extra copy of the DAODs that were accessed at least once) ?

Cheers,

Cedric

34 of 34

CAVEAT EMPTOR

Cache performance will completely depend on how we will use them:

Will Panda be aware of what’s in cache?
Caching for remote access vs SSD backed cache for quick access to frequently used files?
Scheduling to cache-only sites?
Separate caches for production and analysis queues?
Will popularity placement be off?
What will be network cost calculation?
Will we still have pre-placed datasets?
Will we try to separate different input workflows?