1 of 34

XCache

Deployment

Integration

Performance

Ilija Vukotic

December 4, 2018

WBS 2.3.5.4 Intelligent Data Delivery

2 of 34

Current situation

  • Several different groups involved in xrootd based caching
    • ECDF (WN cache, diskless site simulation) (link)
    • SLAC Wei
    • Italian cloud (simulation, planning to convert all sites to diskless)(presentation this morning)
    • German cloud ( simulation, understanding working data set )
    • MWT2, AGLT2 ( simulation, deployed caches in hospital queues)
    • French cloud ( should start simulation soon) (link)
    • BNL ( as we learned yesterday )
  • Non-xrootd based caching
    • David Cameron - ARC (in production) (link)
    • Irvin Umaña Chacon, David Smith - Prague (simulation) (link)
    • DPM/Dynafed (link1, link2)

2

3 of 34

Current understanding of data reuse

Production input are slightly more cacheable (52% accesses and 67% data volume) than Analysis inputs (35% accesses and 37% data volume).

Different file types have very different access patterns (eg. HITS, EVNT, payload files are very cacheable, DAODs, panda*, AODs less so).

As expected it rarely happens the same file is accessed at two different sites.

Claim: even a cache of 100TB per site would be sufficient to deliver roughly half or the accesses and data volume.

3

4 of 34

Current understanding of US scale cache

Multilayer cache wouldn’t significantly help.

Throughput generated between sites would be reasonable.

4

5 of 34

Can XCache deliver what is needed?

Can SLATE deliver what is needed?

  • Stability
  • Correctness
  • Throughput (from a reasonable hardware)

  • Servicing application without contacting sys admin
    • Deploy, Update, Remove
    • Monitoring, Inspection, Debugging

5

Try it and find out!

6 of 34

XCache TODO list from November 7th

  • SC2018 demo (findings)
    • Running XCache server (100Gbps NIC, NVMe storage)
    • Clients running in Azure, Google, AWS
    • Data coming from US sites
    • Pre-tested last week to ~20Gbps
  • Stress testing XCache nodes, determining optimal configuration
  • Get all US sites to deploy SLATE node (MWT2 and AGLT2 done)
  • Get additional data reported in RUCIO traces (pandaID)
  • Simulate XCache with separation of data formats (at US scale)
  • Get “hospital” queues configured and working
  • AGIS auto(de)activation.

6

DONE

DONE

DONE

DONE

7 of 34

Hospital queues

Queues that get jobs not accepted anywhere else. Configured to use remote SEs for input but all inputs go through local XCaches. Output the same as regular jobs.

Setup at MWT2 (production and analysis) and AGLT2 (production). Thanks Judith and Wenjing!

We “mounted” them with US DATADISKs. Not without issues:

  • BNL endpoint would randomly stop authenticating accesses.
  • NET2 endpoint stopped working and happened to be the only instance. REMINDER: all sites are obliged to provide a working xrootd endpoint.

7

8 of 34

XCaches for Hospital queues

MWT2

  • PowerEdge R740XD
  • 8 x 8TB 7.2k RPM NLSAS
  • 1 x Dell 1.6TB NVMe (for metadata)
  • 10 Gbps NIC
  • 2 x Xeon Silver 4116 384GB RAM
  • Possible hardware issue

AGLT2

  • Poweredge R740xd
  • 8 x 8TB 7.2k RPM NLSAS
  • 1 x Dell 1.6TB NVMe (for metadata)
  • 1Gbps NIC connected
  • 2x10Gbps waiting for switch
  • 2x Intel 4116 CPUs, 384 GB ram

8

9 of 34

Monitoring

XCache reporter

Panda job reports

Rucio Traces

K8s monitoring

9

10 of 34

Monitoring

XCache reporter

Panda job reports

Rucio Traces

K8s monitoring

10

11 of 34

Monitoring

XCache reporter

Panda job reports

Rucio Traces

K8s monitoring

11

12 of 34

Monitoring

XCache reporter

Panda job reports

Rucio Traces

K8s monitoring

12

13 of 34

XCache findings

  • Stability
    • Delivered - we had zero instances of XCache server crashing
  • Correctness
    • Still not sure
      • At SC18 checksums not checked.
      • At MWT2 incorrect data in cache at higher loads but could be HW.
      • At AGLT2 still not enough load, will know next week.
  • Throughput (from a reasonable hardware)
    • Not there yet
      • @SC18 maxed at ~2GB/s even with SSDs
      • @MWT2 maxing out 10Gbps NIC, will need a “real” XCache node to test higher.

13

14 of 34

SLATE findings

Very easy to deploy, redeploy, remove application instance. I basically issue 3 commands and I’m done in 30 seconds flat:

I was able to completely manage AGLT2 xcache without Shawn’s or Wenjing’s involvement except initial SLATE and HW setup.

SLATE instance monitoring needs a bit of improvement (logs, per container/pod, instance metrics/events), but even without it completely usable for production level applications.

14

./slate instance list

./slate instance delete <instance name>

./slate app install --vo atlas-xcache --cluster uchicago-prod --conf MWT2.yaml xcache

15 of 34

Next steps

  • Get proper hardware in place at MWT2 and AGLT2.
  • Get SLATE deployed at all US T2s.
  • Get XCache accounting in order.
  • Retest correctness under heavy load.
  • Remeasure performance, find optimal configuration.
  • Simulate XCaches in “Lake aware” job scheduling model.
  • Test one site in diskless mode.

15

16 of 34

Reminder

slides

16

17 of 34

XCache simulation

All simulations:

  • Full file caching
  • Low water mark 85%
  • High water mark 95%
  • LRU - least recently used cached files are expunged first
  • Most numbers and plots can be found here.
  • All the code to reproduce the numbers can be found here. Currently ECDF and LRZ people are using it, and Chris Weaver (UC) is checking everything independently. Still new eyes are always welcome.

17

18 of 34

Access overlaps between sites

18

PRODUCTION

site1

site2

as percent of site1 unique files

as percent of site2 unique files

files accessed on both sites

MWT2

AGLT2

2.82%

11.21%

67538

MWT2

NET2

1.75%

4.03%

42032

MWT2

SWT2

5.74%

9.77%

137555

MWT2

BNL

9.62%

7.00%

230618

AGLT2

NET2

5.33%

3.08%

32117

AGLT2

SWT2

6.75%

2.89%

40668

AGLT2

BNL

15.80%

2.89%

95171

NET2

SWT2

2.06%

1.52%

21423

NET2

BNL

7.14%

2.26%

74427

SWT2

BNL

8.96%

3.83%

126087

ANALYSIS

site1

site2

as percent of site1 unique files

as percent of site2 unique files

files accessed on both sites

MWT2

AGLT2

0.79%

1.97%

26761

MWT2

NET2

0.58%

2.26%

19756

MWT2

SWT2

0.68%

1.27%

22974

MWT2

BNL

2.99%

1.32%

101525

AGLT2

NET2

1.00%

1.56%

13605

AGLT2

SWT2

1.09%

0.82%

14854

AGLT2

BNL

3.33%

0.59%

45354

NET2

SWT2

1.80%

0.87%

15727

NET2

BNL

6.87%

0.78%

60137

SWT2

BNL

3.14%

0.74%

56809

19 of 34

Let’s try simulate 2 layer cache - numbers

19

site

cleanups

avg. accesses

requests

cache hits

requested [TB]

delivered from cache [TB]

cache hits

cache data

xc_AGLT2

110

8.0

2350045

1305904

4553.6

2654.9

55.57%

58.30%

xc_BNL

149

3.3

9926097

5113342

18178.2

10600.2

51.51%

58.31%

xc_MWT2

301.5

5.9

8027849

4303754

13685.7

8720.7

53.61%

63.72%

xc_NET2

124.75

14.5

3549113

2144678

7307.7

5166.3

60.43%

70.70%

xc_SWT2

178

6.2

4275897

2233458

7542.2

4551.5

52.23%

60.35%

xc_Int2

237.8

1.1

13027865

731289

19573.8

1072.1

5.61%

5.48%

ORIGIN

12296576

12296576

18501.7

18501.7

Production Inputs, August and September

20 of 34

Let’s try simulate 2 layer cache - plots

20

Production Inputs, August and September

21 of 34

Let’s try simulate 2 layer cache - numbers

21

Analysis Inputs, August and September

site

cleanups

avg. accesses

requests

cache hits

requested [TB]

delivered from cache [TB]

cache hits

cache data

xc_AGLT2

70.75

1.8

2514129

938976

2336.8

1060.0

37.35%

45.36%

xc_BNL

174

1.9

15487492

5437455

14821.1

6038.6

35.11%

40.74%

xc_Int2

182.8

1.0

18878388

309758

14832.3

459.4

1.64%

3.10%

xc_MWT2

162.25

2.7

6868473

2695981

5099.0

2361.3

39.25%

46.31%

xc_NET2

26

2.0

1588378

632108

1005.4

446.3

39.80%

44.39%

xc_SWT2

83.5

1.9

3138866

1014430

2571.2

1095.1

32.32%

42.59%

ORIGIN

18568630

18568630

14372.9

14372.9

22 of 34

Let’s try simulate 2 layer cache - plots

22

Analysis Inputs, August and September

23 of 34

Adding 350 TB of cache to the system

23

4x10

MWT2, AGLT2, NET2, SWT2

4x30

BNL

5x100

Int2

9x10

MWT2, AGLT2, NET2, SWT2

9x30

BNL

5x30

Int2

24 of 34

What kind of traffic would XCache generate?

We assume each client consumes ⅓ of 1Gbps link (43MB/s). This is an upper limit of what our codes can read (uncompress) and ~2x of what is average read speed measured from LSM monitoring. We assumed enlarged T2 caches.

24

25 of 34

Simulated

Traffic 1

25

26 of 34

Simulated

Traffic 2

26

27 of 34

Simulated

Traffic 3

27

28 of 34

Simulated

Traffic 4

28

29 of 34

MWT2 traffic

All our dCache servers.

Some admixture of OSG jobs...

29

~ 5Gbps ingress (FTS ingress + job output)

~25 Gbps egress (FTS egress + job input)

30 of 34

MWT2 current FTS

30

Egress:

2.6 Gbps

Ingress:

3 Gbps

31 of 34

Containerized XCache service basics

XCache needs:

  • Large enough local disk (not NAS)
  • DTN-equivalent Network (10 Gbps minimum, recommended: 40Gbps )
  • Open port 1094
  • Robot certificate to authenticate itself to the ATLAS data service
  • Handful of config variables
  • User-parsable names eg. xcache.mwt2.org, or xcache.mwt2.slate, xcache.aglt2.slate

31

32 of 34

XCache related information

ATLAS information

General Information

32

33 of 34

Why?

We have a subscription that was defined more than 3 years ago to have a complete replicas of all DAODs to the US sites:

https://rucio-ui.cern.ch/subscription?name=DAODs%20to%20US%20T2%20DATADISK&account=ddmadmin https://its.cern.ch/jira/browse/ATLDDMOPS-5089

During the last days we accumulated a huge backlog of files to transfer on a few sites :

Nbfiles Bytes RSE

7705227 270458200008683 SLACXRD_DATADISK

542134 127768052728015 MWT2_DATADISK

498550 161189127282917 NET2_DATADISK

466033 162974407287848 AGLT2_DATADISK

These transfers compete with other transfers like Production Input or T0 export.

Is this subscription still needed taking into account that a huge volume of DAODs are never touched ? Or can we disable it and think about a smarter placement (e.g. only send the an extra copy of the DAODs that were accessed at least once) ?

Cheers,

Cedric

33

34 of 34

CAVEAT EMPTOR

Cache performance will completely depend on how we will use them:

  • Will Panda be aware of what’s in cache?
  • Caching for remote access vs SSD backed cache for quick access to frequently used files?
  • Scheduling to cache-only sites?
  • Separate caches for production and analysis queues?
  • Will popularity placement be off?
  • What will be network cost calculation?
  • Will we still have pre-placed datasets?
  • Will we try to separate different input workflows?

34