1 of 5

HSF Analysis Ecosystem Workshop

Missing Pieces

Pete Elmer, Fons Rademakers, Graeme Stewart

Photo credit: Wordpress.com

2 of 5

What’s a missing piece?

This is a discussion session to synthesize what has come before, address what has been missed in the coverage, what are the weak spots in the ecosystem, where are there opportunities for new work and new projects, what is the priority work to do, etc. We can also return to interesting questions and discussion points accumulated during the workshop. This session will blend into the final session on conclusions and outcomes.

Topics for discussion:

Addressing questions and discussions deferred to this session
Strengthening and growing the analysis ecosystem
Where are the gaps and opportunities?
What is the priority work going forward?
Prospective/planned funding proposals, new/extended projects
How do we grow the effort and where do we need new effort the most?

3 of 5

A (typical?) HEP Analysis in 2027…

Primary analysis inputs are made centrally in the experiment production system from Run 4 LHC data
I send a high level description of a selection in [MFL]* to the grid to make a reduced sample for my sub-group

Catalogs for data discovery are used, but specifics are irrelevant to me
Task bookkeeping ensures I know what events went into my skim and how much data was run over
Data gets stored in my cloud storage area

Describing my analysis workflow in declarative [MFL] I iterate over different selection criteria

Using some reduced sample to pre-test
This could be happening on my 128 core laptop with 1TB of XPoint memory, or via some notebook backed by a cluster

*My Favourite Language

4 of 5

A (typical?) HEP Analysis in 2027…

I realise my event selector needs better training

I skim off a selection of variables from my monte carlo in ROOT format

There are other formats, but I like this one as it’s storage efficient

These are loaded up and connected to Keras++ via numpy arrays (one choice of many)

After using a backend that I don’t even know what it is I have a trained DNN selector

I rerun by selection (updated git tag!) with the new selector that ROOT has loaded from its native format
Calibrations and enhanced object selections done here

Luminosity and other book keeping is automated

Outputs are now small enough to download to a local machine

I still do my plots in ROOT, because it’s the best for HEP data

5 of 5

Key Elements

Easy access to bookkeeping information and other metadata

Common support for this across experiment boundaries

Declarative language (today we like python) used to describe tasks

Describe what, not how

Analysis scales easily and transparently to different resources

Executor will marshall different components as needed
Integration with cloud/grid resources and special resources
Workflow engine is very robust - user does not need to worry about 1% failures

Easy integration with other tools

Both in persistent formats and transient representations (like numpy and the zoo Jim showed)

This is a significant evolution of the current model, but keeping the model flexible will allow for the fact that:

Not everything can fit into the one model anyway
It should be possible (easy?) to make radical departures from the norm of one or more pieces - there may be a revolution and we just don’t know it yet