ALICE report

costin.grigoras@cern.ch

Main topics

ALICE storage status today

Plans for Run 3

How do we get there?

Storage status today

10:1 average read:write ratio

5PB read per day (analysis for QM2017)

2/3 served by CERN::EOS

Fully federated storage space

Entire storage space visible to jobs and users via the central catalogue

Disk-only SEs

Tape-backed SEs

All job outputs and user files

Only RAW data chunks

58 instances

10 instances

31.2PB used / 43PB total

40.6PB used / (full)

0.91B files (36.3MB / file on average)

39M files (1.1GB / file on average)

Mix of storage providers

46 Xrootd (22.8PB total disk / 14PB tape)
12 v3.* (2 v3.1*, 5 v3.2*, 5 v3.3*)

29 v4.* (4 v4.0*, 8 v4.1*, 3 v4.2*, 9 v4.3*, 5 v4.4*)

12 EOS (17PB total disk / no tape)

11 v0.3*, 1 v4.0*

6 dCache (2.3PB total disk / 2.6PB tape)

2 v2.10.*, 2 v2.13.*, 2 v3.*

2 CASTOR (24PB tape)

2 DPM (0.7PB total disk)

Xrootd protocol

Files are accessed directly via the Xrootd protocol

Jobs open input files remotely

Small files are downloaded with xrdcp

Jobs upload 2-3 copies on completion

Leveraging the excellent WAN capabilities of xrootd

Transfers (mainly RAW data) with

xrdcp --tpc / xrd3cp / xrdcp+xrdcp

Data access policy

Jobs go to the data (98.2%)

Use external replicas for failover

Allow jobs to run anywhere if slots are not available at the desired site(s)

Controllable per analysis

Used to cut the analysis tails

Central catalogue

Single index of all files on all storages

MySQL-backed AliEn File Catalogue

2.5B logical entries

One (powerful) DB master

1.5TB RAM, 2.4TB on disk size

Replication slaves for hot standby / snapshots

2 Catalogue namespaces

Logical File Name (LFN)

Unix-like paths, eg “/alice/data/2015/…

Metadata (object type, size, owner, checksum)
Unique identifier (
GUID)

New versions of an LFN get a new GUID

Any number of physical copies (URLs) associated to GUIDs

One file in a nutshell

/alice/data/2016/LHC16n/000261088/raw/16000261088034.205.root

Metadata:

-rwxr-xr-x alidaq alidaq 264403565
Sep 09 22:10 2016

MD5: 0f24bce32446ea22840d188e035b11a9

GUID:
76CEBD12-76A0-11E6-9717-0D38A10ABEEF

Physical copy URLs (PFNs):
root://alice-tape-se.gridka.de:1094//10/33903/76cebd12-76a0-11e6-9717-0d38a10abeef

root://voalice10.cern.ch//castor/cern.ch/.../16000261088034.205.root

LFN namespace

Namespace split hierarchically into tables

2.5B entries in total (including folders)

1180 tables (largest 55M entries)

Tables are split as needed to
keep them reasonably sized

and also preemptively

per user account

per production / dataset

/alice/

data/ sim/

2015/ 2016/ 2015/ 2016/

LHC15n/ LHC16n/ LHC15n1b LHC16a1c

000261088/raw/16000261088034.205.root

...

GUID namespace

Version 1 UUIDs (date-time and MAC addr)

Split by time intervals

Dynamically, function of current chunk’s size

2.4B entries in 166 tables

Largest is 210M entries

now

time

Physical file pointers

Associated to GUIDs (not a separate ns)

3B entries

Some URLs point to ZIP archive members

guid:///35f5bf88-78cc-11e6-a8c5-936566c57e15?ZIP=AliAOD.root

920M physical files

root://eosalice.cern.ch:1094//05/42395/35f5bf88-78cc-11e6-a8c5-936566c57e15

DB query rates

1y averages:

11500 Hz Selects

570 Hz Changes

260 Hz Deletes

80000 Hz sustained query rate peaks

71500 running jobs

No. of catalogue entries

2x more

files in one year

AAA

Central Services

Catalogue DB

LFN -> URLs

Permissions

Metadata

...

Private
VO key

Token

I need /alice/some/file

Try first

root:///…./guid

Disk
SE

Tape

SE

Token

Give me

root:///.../guid

Public

VO key

LDAP

user DNs

Public

VO key

Current token structure

Encrypted XML with file details

-----BEGIN ENVELOPE-----

CREATOR: AuthenX

...

-----BEGIN ENVELOPE BODY-----

<authz>

<file>

<access>read</access> <turl>root://t2-xrdrd.lnl.infn.it:1094//08/41595/f10139b0-c9ae-11e5-a508-2c44fd849358</turl>

<lfn>/NOLFN</lfn>

<size>479</size>

<guid>F10139B0-C9AE-11E5-A508-2C44FD849358</guid>

<md5>3a0ff51c8b46f853e22f084f4e50a0e2</md5>

<pfn>/08/41595/f10139b0-c9ae-11e5-a508-2c44fd849358</pfn>

<se>ALICE::LEGNARO::SE</se>

</file>

</authz>

-----END ENVELOPE BODY-----

-----END ENVELOPE-----

?authz=-----BEGIN SEALED CIPHER-----

E0DUCmnLX+r+6lkLkqhO813OQ4DvS0icIdNfh+9zPoOEwqArPihVmCxiqHp0bGRqhcJRCRcUk+qB

o01gOGKupUunvx8TX6yB8M5-sCXWWv6cxVLXNERQyjWM2TlzBMLG3xh09OnqnEf1YivjU3ojKH0H

wRC7MnajD2wlR7EGp60=

-----END SEALED CIPHER-----

-----BEGIN SEALED ENVELOPE-----

AAAAgE+WGqstnQzpFaM4SZNTn-9+NLoj4w-Gk2TI7WLluUSZG3l66JR1l331xIaNon0wYXTp2LZN

lTsufeOaH9y-HbGKGC9GhWVkciFY35D5FpaZzrOCSyc3PLpjRqARcDqLh4Qh-nJBnZv4lX5jaeZJ

v4JqglO9caQPOCXo0itrQ3+V2E1AA+hQjch5VcbbnDzdxwbl2it1HmOWXMrNFA4PuXgq8iE-QU+V

DMqoszO8KS8+FPhIeFCsy49MpYS+yL+mPw-fyuJR6DxITC0RBbnxGWRDUYEFnkinKBv4ikWpK-zV

2D6Oz1nHvJYEAD6VM-ovdZmuBcpbcYYTobeFYfftet4Db10B19xCeF2-qBJ7y5L1GvYux+RpJ+1p

rwxihNPS6aRV4d1WzPonngEeXQEks-vTgWFzRlOLIik0D6J4PHZ5wNRDouMvXsnSrlTwQS+ZXh6K

mmDURg4+AR9kEDdMyVPfUG+A2epZTLnJ9MNXeyXfKR8jYdtxwJyLkRGDe7EXJCB3gr7CfJFhIKGN

BNDmPD82P1BKiF-2e2Bq7As9u4a7Mu+cKftq2jBYqYIh4l1nWHCxCJ5fCAMHDke-qb1Js1zhP0Db

cVPgLyrpgqrkDkvNUuIQ2QxDS39GuM9-KmNvPpP5G+1D+O0W8Qkk2vvLBGZKrJYStmhDYejyW-UH

IumcykgH3lZr43J0iBs6UAhUbSFV3W5xTCqf1UD-ApVq4fbTFtaI43LQ+dqsuF1BaaKP3w4rhbWQ

0b++V0Z36KMYkGA4FP8tnbmvRx5vmOj3jAyOfVRQH6hSO21N9RxVqjL5A6mlzpjpobs068EKb-rO

NicJYEcvLB7w-Zo5A3iaikn-+eHlnt2ebKblgspKrJtSts0mbAh8s6tow3E5Nig+YUhdXHg5PY3U

SG1tHmsNoPo0GOUFRe3wtBeuUtc8J0xUTUJuSniRPq-VcF2gqEPoRDytLM74tmVA2RReekvghH3O

3g1cKHo-wz5geeTNSDvayHwfbwDcyj6wS3yViXxoWmzUeFwvyyB5

-----END SEALED ENVELOPE-----

Signed URLs

Planning to replace current tokens with signed URLs


?turl=root://t2-xrdrd.lnl.infn.it:1094//08/41595/f10139b0-c9ae-11e5-a508-2c44fd849358&access=read&lfn=/NOLFN&guid=f10139b0-c9ae-11e5-a508-2c44fd849358&size=479&md5=3a0ff51c8b46f853e22f084f4e50a0e2&se=ALICE::LEGNARO::SE&hashord=turl-access-lfn-guid-size-md5-se-hashord-issuer-issued-expires&issuer=jsh_aliendb4&issued=1476363783&expires=1476450183&signature=h3BQgXm53Yy7TWq1HHUUthyEPyRW3PpqfgUl86ubtcvm9rUVEpgdzHjt2i2jdditP6M9jiDu3PPzmLBoe7Cv8pZEao4/37P+N1WXWs1Vem76nZSlemlT95QGzKaGDJdfSTpMoGwM2HYYL0e8IN0ZPnPuWcklxaPRybCyKoXzc+U=

Clients could then freely reuse the file details (to check downloaded files for example)

ALICE Xrootd token auth plugin has an initial implementation of this scheme

Run 3

200K CPU cores expected to be available by start of
Run3

Storage at start of Run3

Similar growth rate for storage capacity

Trade disks for CPUs during LS2

Expected to be filled by start of Run3

Complexity management

We need to transform (logically) O(100) of individual sites to O(10) of clouds/regions

Each cloud/region should provide reliable data management and sufficient processing capability to simplify scheduling and high level data management

`

New O2 facility

463 FPGAs

Detector readout and fast cluster finder

100K CPU cores

To compress 1.1TB/s data stream 14x

5000 GPUs

Reconstruction speed-up

3 CPU + 1 GPU == 28 CPU

60PB disk space

Buffer space to allow for a more precise calibration

The current Grid and more in a single computing center

Analysis facilities

O2/T0/T1 reserved for reconstruction and calibration activities

T2 and HPC will only run MonteCarlo jobs

Analysis input stored directly onto the AF attached storages

5-10PB of very well connected storage

20-30K CPU cores

Optimized for high IO throughput

Projected storage growth

How do we get there?

10x more data / files / queries

350PB / 30B files / 1MHz queries

EOS for regions/clouds?

EOS namespace for the catalogue?

Cassandra as catalogue backend?

Intel 3D XPoint for namespaces?

There is still time but we should have a viable prototype in ~2y

More immediate wish list

TPC supported in both pull and push modes

Main use case: RAW data export from T0 (Castor) to T1s (Xrootd / old Castor / dCache)

Native Java Xrootd client ?

Would definitely prefer to avoid calling xrdcp

HTTP to Xroot gateway?

JSROOT is really cool!

Summary

ALICE relies entirely on xrootd as protocol for data access and transfers

All current (HEP) storage solutions offer the protocol

We do not foresee any significant change to our use of xrootd in the future, including Run3

We do however have a short development wish-list, which we believe will make xrootd more versatile

Summary (2)

Beyond the protocol, we foresee a necessity to simplify the current “one site - one SE” deployment schema through regional storage federations

To achieve the current level of efficiency will necessitate a re-thinking and modifications of both the site operation models and the deployment/control of the storage elements

Work on the above has begun...

ALICE report - Google Slides