AGC at US CMS Analysis Facilities
Carl Lundstedt�University of Nebraska, Lincoln
AGC Workshop, May 4, 2023
Analysis Facilities
Coffea-casa @ Nebraska
Oksana Shadura, John Thiltges, Garhan Attebury,
Carl Lundstedt, Ken Bloom, Sam Albin, Brian Bockelman
Casa Hardware – Flatiron
–12 Dell R750 Servers, 512 GB Ram, 10 3.2 TiB NVMe Drives� Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz (56 threads/CPU, 2 CPU per node)
– 2 x 100Gbps Networking, Calico + BGP
– Running Alma Linux 8.6 (Sky Tiger)
– Ceph-Rook Filesystem @ 183 TiB
– Single V100 Nvidia GPU
– Ceph + Skyhook @ 8.7 TiB Usable
– Kubernetes (v1.26.2)
– Cert-manager, Dex, External-dns, Sealed-secrets, Traefik,CVMFS
Casa Infrastructure & Management
– Configs for casa are kept in GIT
– Changes follow gitops techniques
– Changes are applied in-situ via a Flux agent
Grid / cluster site resources
Kubernetes resources
Data delivery services - ServiceX
HTCondor scheduler
HTCondor workers
Remote data access
JupyterHub
(shared between users)
Jupyter kernel
Dask workers
Dask scheduler
Skyhook
XCache
Dask work.
Dask work.
Dask work.
Dask work.
Dask work.
Non-data Communication
Data flow
Browser
Shared resources between users
Per-user resources
Coffea-Casa
Building blocks: Authentication Tools
Jupyterhub allows for a variety of
authentication methods and we inherit this functionality.
Using OAuth we can select an OIDC service to manage users for us. �
Dummy authentication is also useful for spinning up test instances.
Each instance must be registered and secrets about that client have
to be available in the instance. We seal these secrets so we can
store them in git encrypted.
https://cms-auth.web.cern.ch�https://cilogon.org/oauth2/register
The Four Casa Instances
CMS-Prod (https://coffea.casa) Opendata-Prod (https://coffea-opendata.casa) CMS-Dev Opendata-Dev
Workflow Scale Out
Scale out is accomplished with a custom Dask-Jobqueue Class that
deploys Dask worker nodes in either our T2 resource or in an condor cluster running inside the Flatiron kubernetes.
Storage & Data Access
– Each user give 10GB of persistent storage on login.
– XCache via Tokens issued at login.
– cern.ch CVMFS mounted in the pods.
– User's T2 /store/user mounted in the user pod.
Triton Inference Service
– To leverage the presence of our V100 GPU an inference service is deployed in the CMS Dev
instance.
– Training sets are able to be stored in an S3 bucket deployed for just this task.
s3://rook-ceph-rgw-my-store.rook-ceph.svc:80/triton-c9adf042-ffb8-4221-bd42-e385efb1d0e2
ServiceX
FNAL's Elastic Analysis Facility
Burt Holzman – Project lead
Maria Acosta – Technical lead for applications
Chris Bonnaud – Technical lead for infrastructure
Joe Boyd, Glenn Cooper, Lindsey Gray,
Farrukh Khan, Ed Simmonds,Nick Smith,Elise Chavez
Onsite Login
Login: MultiVO Support
Application Ecosystem
Triton
They've built a multi-VO, secure, integrated, Elastic Analysis Facility prototype in compliance with DOE/Lab cybersecurity requirements
Started as a USCMS project but have grown to be a multi-experiment initiative providing services to our experiments and scientists.
• Developed more than 20 environments for experiments with dedicated CVMFS mounts, shared storage and specific scientific software.
• Collaborating with multiple groups across the laboratory as well as industry partners, open source projects and other institutions has allowed us to gain insights on what our users _really_ want and need.
• Strengthened participation with IRIS-HEP and the USCMS collaboration on building next generation analysis facilities in the US
FNAL's Elastic Analysis Facility
MIT's Analysis Facility
Mariarosaria D'Alfonso, Josh Bendavid, Chad Freer, Zhangqier Wang, Luca Lavezzo, Christoph Paus
subMIT
An MIT Physics Department analysis facility.
→ provide ecosystems to many research areas
subMIT system provides an interactive login pool + scale-out to batch resources
subMIT
CMS connection to subMIT
33
Connected to all resources on campus
MIT Campus Factory�Frontend
US CMS T2�Frontend
OSG Frontend
submit.mit.edu
CMS T3_US_MIT
CMS T2_US_MIT
MIT EAPS
CMS T2_US_Y
Various universities�and laboratories
Restricted to�CMS members
Normal priority
Preempted
Campus
HPRCF, Bates
includes CMS Tier-2
CMS Tier-3
EAPS cluster
Earth and Planetary Sciences
Virtual Center
'subMIT'
Campus
FrontEnd
OSG
FrontEnd
CMS
FrontEnd
Open Science Grid – OSG
- plenty of resources across US
CMS Computing
- CMS resources across world
Limited to CMS
CTP,
MKI,
….
Examples of workflows on subMIT from LHC/CMS
Very different analysis requirement
Common features: use the nanoAOD simplified data format as input
34
Deployed Features by Analysis Facility
| Dask support | Batch | Xcache | ServiceX | JH Interface | Mlflow | Triton | GPU support |
EAF� | Dask Gateway | HTCondor | x | x | x | - | x | x |
� @ UNL | Via HTCondor | HTCondor | x | x | x | x | x | 1 GPU |
� | Via Slurm | SLURM | ? | - | x | - | ? | 4 GPU per node |
| WIP | WIP | ? | - | x | - | x | 2 GPU |
AGC at US CMS Analysis Facilities
Questions?