1 of 42

Response to Questions/Requests for Analysis Systems

Kyle Cranmer (NYU)

2 of 42

Analysis Systems Team

2

Institutions: NYU, Washington, Princeton, Cincinnati, Illinois

3 of 42

Summarize the set of projects/activities and associated effort for your area

See next two pages

3

4 of 42

Projects

Analysis systems are connected to analysis use cases
Systems are composed of components
Most of these projects refer to those components

many projects include people beyond IRIS-HEP

Milestones and activities mainly oriented towards integration, evaluation, with a global overview of the vertical slice

4

5 of 42

Summarize the set of projects/activities and associated effort for your area

Awkward, uproot:

Jim P. Ianna O. (were funded largely on DIANA-HEP until ~now)

func_adl & ServiceX:

Gordon + Mason @ UW in collaboration with Ben G & Marc W.

pyhf: Matthew Feickert (~80%)

MadMiner & Exploratory ML: Johann Brehmer (~50%)

Integrating tools & developing declarative specifications for end-user:

Alexander Held (~80%)

Modernizing analysis support (eg. docker, analysis preservation, training, conda forge, etc...)

Matthew Feickert (~20%), Alexander Held (~20%), Henry Schreiner (~25%?)

Cyberinfrastructure: Sinclert Perez (~30% starting 2020, HEPData and RECAST)

Core Scikit-HEP : boost-histogram, hist, vector, … Henry Schreiner (~65%)

Misc Scikit-HEP: scikit-hep org, particle, decay language Henry Schreiner (10%), Daniel Vieira (???) not participating in biweekly meetings)

5

It should be easier for area leads to see the FTEs

ServiceX activities overlap with DOMA/SSL, not sure how to deal with accounting for the purposes of this meeting.

6 of 42

Are there internal or external collaborations associated with each project or activity? For external collaborations, is IRIS-HEP leading, contributing or simply “connecting/liaising”?

Internal:

SSL: benchmarking and scaling, REANA testbeds, etc.
SSL & DOMA: ServiceX

External:

DIANA/HEP: last bits of funding on NCE supporting various items very aligned
SCAILFIN: developing products, good synergy w/ IRIS-HEP. REANA dev team
INSPIRE-HEP, HEPData, CAP, Invenio: Advisory boards, join in development
ATLAS stats effort: docker containers for RooFit-based statistical analysis & combinations and development of pyhf tools. IRIS-HEP (Matthew, Kyle, Alex) & Lukas & Giordon are leading
HEP Statistics Serialization Standard (HS3) similar cast of characters
scikit-hep: useful umbrella (not seen as US, ATLAS/CMS, or HSF) IRIS-HEP leading by example

Awkward:

formal collaboration with Amy Roberts at UC Denver on Kaitai Structs
frequent collaboration with LPC/Coffea (Lindsey Gray)
close liaisons with Anaconda.com: Numba and Dask developers
intermittent contact with Oxford Big Data Institute (genetics, developers of Zarr)

6

7 of 42

Which project/activities/goals are making progress and which are not? (Area lead’s opinion) For those that are not, what is impeding progress?

Projects making good progress with good adoption:

uproot and awkward usage in physics and by developers
Service X & func_adl code development is active and groups seem to coordinate well
pyhf adoption in physics analysis (example: ATLAS SUSY EWK using for official combinations)
REANA growing in popularity and planning to deploy at facilities
MadMiner interest is growing: ~100 person tutorial, had full of ATLAS & CMS people trying it out
Scikit-hep community growing rapidly, interest from stats committees & physics groups of experiments

Some issues / concerns

ServiceX for CMS had some bumps, but that is being addressed. Large effort should have big win.
LHCb effort in GooFit/AmpGen/DecayLanguage not well integrated (though conceptually good fit)
Role of “Analysis Facility” not part of discussions in Analysis Systems group, not well integrated.

Basically completed

Awkward (1.0? clarify)
Uproot 3.x (future is Uproot 4.x)
PPX protocol
Boost histogram (future is hist)
MadMiner as a tool (user community growing)

7

analysis facility blueprint needed

A gap in planning.

8 of 42

How are each of these projects/activities connected to, being informed by or planning on delivering (eventually) to the experiments? Are there relevant blueprint meetings or workshops that should happen to make progress?

IRIS-HEP stats tools making good progress in ATLAS
RECAST, pyhf, analysis preservation making good progress in ATLAS

Clemens Lange looking for RECAST contributors for CMS, added a Fellows project to encourage this.

MadMiner is having tires kicked by ATLAS & CMS experimentalists

closely related reweighting based on CARL an ATLAS qualification project

Coffea being picked up increasingly in CMS

corresponding effort needed for ATLAS and LHCb. IRIS-HEP fellow project added to encourage this
Alex Held and KyungEon working on TRExFitter analog using these tools

A blueprint connected to AS grand challenge & analysis facility may help
ServiceX in R&D phase now, but need to check on planning for ServiceX for analysis facility: Skyhook, River, …

answer requested on SLACK ServiceX channel (next page)

8

KyungEon doesn’t seem to formally be part of IRIS-HEP. �He should be. Maybe a fellow?

9 of 42

How are each of these projects/activities connected to, being informed by or planning on delivering (eventually) to the experiments? Are there relevant blueprint meetings or workshops that should happen to make progress?

ServiceX is designed to facilitate high-performance array-based analyses. It does this by allowing users to construct sophisticated in-place data queries via an analysis description language, performing on-the-fly data transformations into convenient analysis formats, and connecting the output to future analysis facilities.

It is closely connected to both ATLAS and CMS, and features the ability to handle experiment-specific input formats like xAOD (ATLAS) and miniAOD (CMS).
The developers are currently in close contact with members of each experiment to ensure the service will improve time-to-insight in both cases, and to make it possible to easily develop additional transformers for new input formats.
While ServiceX is currently in a prototyping phase, it seeks to reach production version 1.0 by late May, at which point the service will benefit significantly from formal blueprint meetings with both collaborations.

9

10 of 42

What would be potential Year 3 milestones for each of the projects? (First ideas, to be iterated with PIs and the whole team as this process moves forward.)

Integration of func_adl specification for variable definition and selections with the emerging specification for high-level TRExFitter-like analysis
Demonstration of differentiable analysis pipeline (eg new TRExFitter) ending with pyhf limit back-propagating through selection implemented with awkward, func_adl, etc.

connect to pyhf/neos demo. Need autodiff-able awkward
connect to histogramming projects
Discussion in Slack to connect this with the Sally algorithm in MadMiner

Documentation and training event using new tools
Use of new IRIS-HEP tools (MadMiner, awkward, …) for analysis in LHC experiment (may not be published by end of Y3)
Snowmass (tools & REANA workflows �for sensitivyt studies)

10

autodiff blueprint

(may need to send gradients back to analysis facility via ServiceX)

11 of 42

What “grand challenges” would be useful to organize involving your area during Year 3 of IRIS-HEP? How would these challenges depend on efforts from other areas of IRIS-HEP, the US LHC Ops programs or the experiments?

(Assuming not to be completed in Y3, but organized in Y3)

Ability to process XX TB of data in YY minutes using columnar analysis tools
End-to-end analysis optimization with automatic differentiation on large (~TB) simulated data sets with multiple signal and background components and systematic uncertainties. [touches on ServiceX]
Ability to test a new theory by reinterpreting multiple analyses and performing a statistical combination of their results in ~1 day (+ the time it takes to generate new signal MC) [touches on LHC Ops b/c would use production system]

extend by using excursion to streamline MC production for ATLAS or CMS reinterpretation campaign [touches on LHC Ops b/c would use production system]

Ability for new user to “fork” an analysis, make a modification, and obtain new results in an afternoon. [touches the experiments and training]

11

12 of 42

Are there new opportunities where effort from IRIS-HEP can make an impact? Is the alignment of the focus areas in IRIS-HEP appropriate?

Visualization tools (eg. altair like declarative visualization)

yes for AS, but expansion of scope

excursion alg. to streamline MC production for ATLAS or CMS reinterpretation campaign

yes for AS

Improve efficiency of event gen. with ML-inspired tools & techniques

yes for AS, but an expansion into “theory” tools

pyhf and astrophysics (HEALPix for boost histogram)

yes for AS, but secondary aim of IRIS-HEP

MadMiner like tools for EIC

yes for AS, but secondary aim of IRIS-HEP. Brought up at 18 mo review

python library for fastjet that plays well with columnar analysis
Documentation efforts

yes, aligns with “lowering barriers” goal of AS

12

13 of 42

How are projects currently managed in your area? What tools are being used? How is progress measured? How are risks recorded, identified and mitigated?

13

Progress tracked on GitHub: github.com/iris-hep/project-milestones/

14 of 42

Are the metrics being used to measure success clearly defined? How well do metrics in your area measure progress, success or impact? Where can the metrics be improved or refined to better measure progress, success or impact?

Metrics listed on next page for reference.

Metrics are clearly defined,

I put some effort into defining reasonable targets for them
some targets are “relative’ (fraction of specs that are implemented) and some are ‘absolute’ (number of XXX)
Not clear if targets should be for current time or end of Y5

They are ok at measuring progress / effort, though not necessarily great at measuring “success”.

hard to assess meaningful measures of “success’ early in (R&D) phase
measures of user adoption & results using new tools will lag development >1 yr

Maybe better to enumerate AS components needed for a few vertical analysis use cases and track what fraction of those components are implemented (and connect with performance benchmarking of individual components)

14

15 of 42

Metrics

M.2.1: Number of specifications developed

12 thus far. Expect maybe 50 after 5 years

M.2.2: Number of implementations for corresponding specifications

5/12: ppx, func_adl, pyhf, aghast, histos, decay language

M.2.3: Throughput and latency metrics for analysis systems using SSL testbed

Aiming for ~10x speedup for various analysis tasks. Seeing >100x in some cases

M.2.4: List of experiments using CAP and number of analyses stored in CAP

14 ATLAS analyses with workflows in CAP/REANA/RECAST-ready format

M.2.5: Number of results / papers making use of CAP/REANA

3 thus far, more on the way

M.2.6: GitHub stars, forks, watch, contributor statistics

12 GitHub repos
healthy statistics for core projects

15

eg. uproot & awkward

16 of 42

Backup

16

17 of 42

Prior to IRIS-HEP

Bulk Data Processing

Reconstruction Algorithms

Analysis Code

Analysis code in HEP is often more free-form with less organized development:

one-off approach limits functionality
slow iteration cycle
slow on-boarding and lack of interoperability
difficult to reproduce and reuse

primarily ROOT & C++
lack of developer community
overlapping solutions
data redundancy

17

18 of 42

IRIS-HEP as an Institute

Analysis Systems

ad hoc analysis code

Analysis Systems strategies:

improve functionality & interoperability
more modular, less dependence on ROOT
declarative: focus on what to do not how to do it
align with modern data science practices

18

IRIS-HEP

19 of 42

Analysis Systems

Develop sustainable analysis tools to extend the physics reach of the HL-LHC experiments

create greater functionality to enable new techniques,
reducing time-to-insight and physics,
lowering the barriers for smaller teams, and
streamlining analysis preservation, reproducibility, and reuse.

19

Analysis Systems projects span all stages of end-user analysis.

20 of 42

Scikit-HEP

20

A broad community project with heavy IRIS-HEP involvement.

21 of 42

A coherent ecosystem

21

One of our analysis use cases involves a vertical slice from ServiceX to final limits for a real-world ATLAS Higgs analysis. See Alex Held’s poster.

22 of 42

A coherent ecosystem

22

ServiceX

yadage

func_adl formulate

coffea

23 of 42

The Future

IRIS-HEP Focus Areas

23

Slides from Johann Brehmer’s Keynote talk at ACAT on Constraining Effective Field Theories with Machine Learning

Tight integration of

Simulation
Machine Learning
Statistical Inference

24 of 42

Major Activities

Development of declarative specifications for different stages of analysis�
Identification and benchmarking of traditional implementations for benchmark example use-cases that span the scope of AS�
Implementation of prototype components & integration

connection with DOMA (particularly ServiceX)�

Benchmarking and assessment of prototype implementations and declarative specifications for the same example use cases

connection with SSL (dedicated Blueprint Activity) �

Exploratory research in machine learning that may impact how analysis is performed�
Engagement with community of early adopters and developers

24

25 of 42

Connections to DOMA & SSL

ServiceX is part of DOMA’s iDDS

feeds data to downstream analysis systems�
uses components from analysis systems to:

read ROOT-formatted data
transform analysis languages
export data formatted for downstream analysis

25

ServiceX

Data Lake

Cached

Distribution

Analysis Systems

ServiceX is being prototyped using IRIS-HEP’s Scalable Systems Lab

10 TB xAOD data from ATLAS using IRIS functional analysis description language
CMS example using uproot & awkward

B. Galewsky, R. Gardner, L. Gray, M. Neubauer,J. Pivarski, M. Proffitt, I. Vukotic, G. Watts, M. Weinberg

26 of 42

Milestones and Deliverables

26

27 of 42

Milestones and Deliverables

27

Progress tracked on GitHub: github.com/iris-hep/project-milestones/

28 of 42

Metrics

M.2.1: Number of specifications developed

12 thus far. Expect maybe 50 after 5 years

M.2.2: Number of implementations for corresponding specifications

5/12: ppx, func_adl, pyhf, aghast, histos, decay language

M.2.3: Throughput and latency metrics for analysis systems using SSL testbed

Aiming for ~10x speedup for various analysis tasks. Seeing >100x in some cases

M.2.4: List of experiments using CAP and number of analyses stored in CAP

14 ATLAS analyses with workflows in CAP/REANA/RECAST-ready format

M.2.5: Number of results / papers making use of CAP/REANA

3 thus far, more on the way

M.2.6: GitHub stars, forks, watch, contributor statistics

12 GitHub repos
healthy statistics for core projects

28

eg. uproot & awkward

29 of 42

Community Building

Active participation in relevant venues:

HEP Software Foundation, PyHEP, CHEP, ACAT, HOW, …
Internal experiment meetings, IRIS-HEP topical meetings
Partner projects: DIANA, SCAILFIN, CERN-IT, ...
>100 presentations and 20 publications thus far�

High-profile projects that provide clarity of vision and leadership

Scikit-hep, uproot, awkward-array, histos, (Coffea)
MadMiner, AmpGen, functional analysis description language
pyhf, RECAST�

Growing community of early adopters using tools now

>1000 downloads / week for uproot

29

“I just wanted to express my personal awe to you and your team working so hard on a bunch of wonderful projects. The talks delivered by Johann, Lukas and Gunes were excellent! In my personal opinion it was the best part of ACAT conference.” � - Andrey Ustyuzhanin (LHCb & Yandex School of Data Analysis)

30 of 42

Training

30

31 of 42

Presentations & Publications

31

108 presentations and 22 publications thus far�

32 of 42

Value of IRIS-HEP as an Institute

IRIS-HEP as a tugboat:

direct and navigate large efforts in the collaborations with significant inertia�
take advantage of consistent presence and messaging within the large collaborations�
Examples:

pythonic analysis tools
software practices
industry-standards

32

33 of 42

Value of IRIS-HEP as an Institute

IRIS-HEP as a lighthouse:

provide cohesive, long-term vision for how software should evolve to meet needs of HL-LHC�
take advantage of holistic perspective of the institute�
Examples:

columnar analysis
declarative programming
preservation & reuse

33

34 of 42

Highlight

The field is at a tipping point, DIANA/DASPOS/IRIS-HEP contributions have been transformational.

First results using the RECAST reinterpretation framework and publishing full statistical likelihoods (using pyhf)

34

ROOT: 10+ hours

pyhf: < 30 minutes

35 of 42

Highlight

35

Featured on CERN homepage

36 of 42

Highlight

IRIS-HEP Focus Areas

36

Finalist for best paper award at SC19 (Super Computing)

Largest scale Bayesian inference ever using in a universal probabilistic programming language

Applied to complex LHC Physics use case: Sherpa code base of ~1M lines of code in C++�

230x speedup for synchronous data parallel training of a 3DCNN-LSTM neural network

1,024 nodes (32,768 CPU cores)
128k minibatch size, largest for this NN architecture
One of the largest-scale use of PyTorch built-in MPI �

Novel protocol (PPX) to execute & control existing, large-scale, scientific simulator code bases

https://arxiv.org/abs/1907.03382

37 of 42

Beyond HEP

Co-organized by IRIS-HEP members
188 Paper submissions
University + Industry + Labs
Diversity in topics and participants

37

38 of 42

Beyond HEP

Machine learning & statistical techniques originally developed for LHC now being used to probe Dark Matter with gravitational strongly lensing

38

arXiv:1909.02005 published in The Astrophysical Journal.

39 of 42

Beyond HEP

Collaboration with DeepMind on AI techniques inspired by physics

Relevant for:

HEP
nuclear physics (lattice QCD)
cosmology
geology
protein structure
robotics

39

https://arxiv.org/abs/2002.02428

Protein figure from Boomsma [https://doi.org/10.1073/pnas.0801715105]

in collaboration with

40 of 42

Beyond HEP

40

See Sebastian Macaluso’s poster highlighting exploratory machine learning projects.

examples of use-inspired research
connections to natural language processing (NLP) and genomics

https://arxiv.org/abs/2002.11661

41 of 42

Beyond HEP

Collaboration with DeepMind on AI techniques inspired by physics

Models that incorporate physics generalize to unseen systems (zero-shot learning)

41

42 of 42

Beyond HEP

Collaborating with CS & astrophysics on computing models and tools to use HTC and HPC together, published as:

E. A. Huerta, R. Haas, S. Jha, M. Neubauer, D. S. Katz, "Supporting High‐Performance and High‐Throughput Computing for Experimental Science," Computing and Software for Big Science 3:5, 2019. doi: 10.1007/s41781-019-0022-7

42

Components involved in starting a Shifter job on Blue Waters (HPC). Jobs are submitted to workload manager on Blue Waters’ login nodes, which launches jobs on compute nodes. When job requests use containers, workload manager first uses Shifter runtime environment to pull an up-to-date copy of the container image from Docker Hub. This image is repackaged as a user-defined image, then pre-mounted (prologue) by the jobs on the compute nodes and unloaded post-job (epilogue).

Left: period of time during which 35 million ATLAS events were processed using 300 Blue Waters nodes. Utilization during this period averaged 81%, typical for Blue Waters. Right: backlog of queued jobs for the same period in requested nodes, with colors indicating user accounts. During this period, the queued workload never dropped below 80,000 nodes i.e., four times the number of nodes in Blue Waters. The red and blue curves below the horizontal axis are nodes available for work scavenging during this period.