XCache on RU-DataLake (on a dedicated server and on distributed nodes)
deployment and testing
with ATLAS tools
Andrey Zarochentsev, Aleksandr Alekseev, Stephane Jezequel,
Andrey Kiryanov, Alexei Klimentov, Tatiana Korchuganova, Danila Oleynik
Russian DataLake RnD
2
This report is about Distributed XCache on nodes
Direct Access without cache
Services for Distributed xCache on nodes:
Services on all nodes : cmsd – server, xrootd – server
Service on one node: cmsd – manager, xrootd - manager
3
xCache
WN
WN
WN
WN
WN
WN
xCache
WN
xCache
WN
xCache
WN
xCache
WN
WN
WN
Distributed xCache on nodes
Dedicated xCache
Technical characteristics
4
| PNPI ← | JINR ← |
PNPI → | 10Gbps | 3Gbps, ~5.3 s Latency |
JINR → | 1.4Gbps, ~8.4 s Latency | |
Hammer cloud settings for tests
5
1st test:�Distributed xCache on nodes vs Dedicated XCache vs Direct Access (tests run simultaneously). Test copy2scratch.
Histograms and corresponding Gaussians curves
6
Download input file time*
totaltime*
payload (Athena) running time*
* walltime metrics reported by Pilot: time to fetch job, setup, stage in, run payload, stage out. totaltime is sum of them.
1st test:�Distributed xCache on nodes vs Dedicated XCache vs Direct Access (tests run simultaneously). Test copy2scratch.
Histograms and corresponding Gaussians curves
7
Download input file time*
* walltime metrics reported by Pilot: time to fetch job, setup, stage in, run payload, stage out. totaltime is sum of them.
1st test:�Distributed xCache on nodes vs Dedicated XCache vs Direct Access (tests run simultaneously). Test copy2scratch.
Histograms and corresponding Gaussians curves
8
totaltime*
* walltime metrics reported by Pilot: time to fetch job, setup, stage in, run payload, stage out. totaltime is sum of them.
1st test:�Distributed xCache on nodes vs Dedicated XCache vs Direct Access (tests run simultaneously). Test copy2scratch.
Histograms and corresponding Gaussians curves
9
payload (Athena) running time*
* walltime metrics reported by Pilot: time to fetch job, setup, stage in, run payload, stage out. totaltime is sum of them.
1st test:�Distributed xCache on nodes vs Dedicated XCache vs Direct Aaccess (tests run simultaneously). Test copy2scratch.
10
| Direct access PNPI-TEST2 | Distributed XCache PNPI_XCACHE-NODE | Single XCache server PNPI_XCACHE-TEST |
Download input file time | μ=137 σ=79 | μ=30 σ=11 | μ=32 σ=13 |
Athena running time | μ=321 σ=33 | μ=411 σ=31 | μ=335 σ=41 |
totaltime | μ=472 σ=81 | μ=460 σ=43 | μ=384 σ=50 |
Number of completed tests | 121 | 107 | 120 |
2nd test:�Distributed XCache on nodes vs Dedicated XCache vs Direct Access (tests run separately). Test copy2scratch.
Histograms and corresponding Gaussians curves
11
Download input file time*
totaltime*
payload (Athena) running time*
* walltime metrics reported by Pilot: time to fetch job, setup, stage in, run payload, stage out. totaltime is sum of them.
Zabbix monitoring
12
PNPI_XCACHE-NODE
PNPI_XCACHE-TEST
PNPI-TEST2
Zabbix monitoring
13
Kibana XCache monitoring
Monitoring xCache with details for one of the files saved on node v010.
14
2nd test:�Distributed xCache on nodes vs Dedicated XCache vs Direct Access (tests run separately)
15
| PNPI-TEST2 | PNPI_XCACHE-NODE | PNPI_XCACHE-TEST |
Download input file time | μ=290s σ=677 | μ=114s σ=587 | μ=27s σ=7 |
Athena running time | μ=309s σ=19 | μ=340s σ=82 | μ=311s σ=21 |
totaltime | μ=616s σ=678 | μ=469s σ=607 | μ=354s σ=28 |
Number of completed tests | 771 | 528 | 1109 |
CPU user time | μ=30% σ=14 | μ=31% σ=12 | μ=40% σ=16 |
Incoming network | μ=46Mb/s σ=31 | μ=109Mb/s σ=34 | μ=74Mb/s σ=28 |
Results and nearest plans
16
Monitoring
RU-DataLake HC tests monitoring
Data Lake infrastructure monitoring based on ELK-stack:
17
Thanks
18
Backup
19
1st test:�Distributed xCache on nodes vs Dedicated xCache vs Direct Access (tests run simultaneously)
20
2nd test:�Distributed xCache on nodes vs Dedicated xCache vs Direct Access (tests run separately)
21
Russian DataLake 2019 (phase 1)
Reading through xCache
Direct writing
22
JINR SE
dCache
site CE
xCache
site CE
xCache
site CE
xCache
site CE
xCache
Russian Data Lake Phase 2
(2020 – 2021)
Reading through xCache
Writing to closest pool
Replication on demand
23
JINR SE
EOS mgm
site CE
xCache
site CE
xCache
site CE
xCache
site CE
xCache
EOS pools
EOS pools
EOS pools
EOS pools
Russian Data Lake testbed on 2020�
24
Testbed changes
25