1 of 5

HSF Analysis Ecosystem Workshop

Missing Pieces

1

Pete Elmer, Fons Rademakers, Graeme Stewart

Photo credit: Wordpress.com

2 of 5

What’s a missing piece?

This is a discussion session to synthesize what has come before, address what has been missed in the coverage, what are the weak spots in the ecosystem, where are there opportunities for new work and new projects, what is the priority work to do, etc. We can also return to interesting questions and discussion points accumulated during the workshop. This session will blend into the final session on conclusions and outcomes.

Topics for discussion:

  • Addressing questions and discussions deferred to this session
  • Strengthening and growing the analysis ecosystem
  • Where are the gaps and opportunities?
  • What is the priority work going forward?
  • Prospective/planned funding proposals, new/extended projects
  • How do we grow the effort and where do we need new effort the most?

2

3 of 5

A (typical?) HEP Analysis in 2027…

  • Primary analysis inputs are made centrally in the experiment production system from Run 4 LHC data
  • I send a high level description of a selection in [MFL]* to the grid to make a reduced sample for my sub-group
    • Catalogs for data discovery are used, but specifics are irrelevant to me
    • Task bookkeeping ensures I know what events went into my skim and how much data was run over
    • Data gets stored in my cloud storage area
  • Describing my analysis workflow in declarative [MFL] I iterate over different selection criteria
    • Using some reduced sample to pre-test
    • This could be happening on my 128 core laptop with 1TB of XPoint memory, or via some notebook backed by a cluster

3

*My Favourite Language

4 of 5

A (typical?) HEP Analysis in 2027…

  • I realise my event selector needs better training
    • I skim off a selection of variables from my monte carlo in ROOT format
      • There are other formats, but I like this one as it’s storage efficient
    • These are loaded up and connected to Keras++ via numpy arrays (one choice of many)
      • After using a backend that I don’t even know what it is I have a trained DNN selector
  • I rerun by selection (updated git tag!) with the new selector that ROOT has loaded from its native format
  • Calibrations and enhanced object selections done here
    • Luminosity and other book keeping is automated
  • Outputs are now small enough to download to a local machine
    • I still do my plots in ROOT, because it’s the best for HEP data

4

5 of 5

Key Elements

  • Easy access to bookkeeping information and other metadata
    • Common support for this across experiment boundaries
  • Declarative language (today we like python) used to describe tasks
    • Describe what, not how
  • Analysis scales easily and transparently to different resources
    • Executor will marshall different components as needed
    • Integration with cloud/grid resources and special resources
    • Workflow engine is very robust - user does not need to worry about 1% failures
  • Easy integration with other tools
    • Both in persistent formats and transient representations (like numpy and the zoo Jim showed)

This is a significant evolution of the current model, but keeping the model flexible will allow for the fact that:

  • Not everything can fit into the one model anyway
  • It should be possible (easy?) to make radical departures from the norm of one or more pieces - there may be a revolution and we just don’t know it yet

5