1 of 42

Response to Questions/Requests for Analysis Systems

Kyle Cranmer (NYU)

2 of 42

Analysis Systems Team

2

Institutions: NYU, Washington, Princeton, Cincinnati, Illinois

3 of 42

Summarize the set of projects/activities and associated effort for your area

See next two pages

3

4 of 42

Projects

  • Analysis systems are connected to analysis use cases
  • Systems are composed of components
  • Most of these projects refer to those components
    • many projects include people beyond IRIS-HEP
  • Milestones and activities mainly oriented towards integration, evaluation, with a global overview of the vertical slice

4

5 of 42

Summarize the set of projects/activities and associated effort for your area

Awkward, uproot:

  • Jim P. Ianna O. (were funded largely on DIANA-HEP until ~now)

func_adl & ServiceX:

  • Gordon + Mason @ UW in collaboration with Ben G & Marc W.

pyhf: Matthew Feickert (~80%)

MadMiner & Exploratory ML: Johann Brehmer (~50%)

Integrating tools & developing declarative specifications for end-user:

  • Alexander Held (~80%)

Modernizing analysis support (eg. docker, analysis preservation, training, conda forge, etc...)

  • Matthew Feickert (~20%), Alexander Held (~20%), Henry Schreiner (~25%?)

Cyberinfrastructure: Sinclert Perez (~30% starting 2020, HEPData and RECAST)

Core Scikit-HEP : boost-histogram, hist, vector, … Henry Schreiner (~65%)

Misc Scikit-HEP: scikit-hep org, particle, decay language Henry Schreiner (10%), Daniel Vieira (???) not participating in biweekly meetings)

5

It should be easier for area leads to see the FTEs

ServiceX activities overlap with DOMA/SSL, not sure how to deal with accounting for the purposes of this meeting.

6 of 42

Are there internal or external collaborations associated with each project or activity? For external collaborations, is IRIS-HEP leading, contributing or simply “connecting/liaising”?

Internal:

  • SSL: benchmarking and scaling, REANA testbeds, etc.
  • SSL & DOMA: ServiceX

External:

  • DIANA/HEP: last bits of funding on NCE supporting various items very aligned
  • SCAILFIN: developing products, good synergy w/ IRIS-HEP. REANA dev team
  • INSPIRE-HEP, HEPData, CAP, Invenio: Advisory boards, join in development
  • ATLAS stats effort: docker containers for RooFit-based statistical analysis & combinations and development of pyhf tools. IRIS-HEP (Matthew, Kyle, Alex) & Lukas & Giordon are leading
  • HEP Statistics Serialization Standard (HS3) similar cast of characters
  • scikit-hep: useful umbrella (not seen as US, ATLAS/CMS, or HSF) IRIS-HEP leading by example
    • Awkward:
      • formal collaboration with Amy Roberts at UC Denver on Kaitai Structs
      • frequent collaboration with LPC/Coffea (Lindsey Gray)
      • close liaisons with Anaconda.com: Numba and Dask developers
      • intermittent contact with Oxford Big Data Institute (genetics, developers of Zarr)

6

7 of 42

Which project/activities/goals are making progress and which are not? (Area lead’s opinion) For those that are not, what is impeding progress?

Projects making good progress with good adoption:

  • uproot and awkward usage in physics and by developers
  • Service X & func_adl code development is active and groups seem to coordinate well
  • pyhf adoption in physics analysis (example: ATLAS SUSY EWK using for official combinations)
  • REANA growing in popularity and planning to deploy at facilities
  • MadMiner interest is growing: ~100 person tutorial, had full of ATLAS & CMS people trying it out
  • Scikit-hep community growing rapidly, interest from stats committees & physics groups of experiments

Some issues / concerns

  • ServiceX for CMS had some bumps, but that is being addressed. Large effort should have big win.
  • LHCb effort in GooFit/AmpGen/DecayLanguage not well integrated (though conceptually good fit)
  • Role of “Analysis Facility” not part of discussions in Analysis Systems group, not well integrated.

Basically completed

  • Awkward (1.0? clarify)
  • Uproot 3.x (future is Uproot 4.x)
  • PPX protocol
  • Boost histogram (future is hist)
  • MadMiner as a tool (user community growing)

7

analysis facility blueprint needed

A gap in planning.

8 of 42

How are each of these projects/activities connected to, being informed by or planning on delivering (eventually) to the experiments? Are there relevant blueprint meetings or workshops that should happen to make progress?

  • IRIS-HEP stats tools making good progress in ATLAS
  • RECAST, pyhf, analysis preservation making good progress in ATLAS
    • Clemens Lange looking for RECAST contributors for CMS, added a Fellows project to encourage this.
  • MadMiner is having tires kicked by ATLAS & CMS experimentalists
    • closely related reweighting based on CARL an ATLAS qualification project
  • Coffea being picked up increasingly in CMS
    • corresponding effort needed for ATLAS and LHCb. IRIS-HEP fellow project added to encourage this
    • Alex Held and KyungEon working on TRExFitter analog using these tools
  • A blueprint connected to AS grand challenge & analysis facility may help
  • ServiceX in R&D phase now, but need to check on planning for ServiceX for analysis facility: Skyhook, River, …
    • answer requested on SLACK ServiceX channel (next page)

8

KyungEon doesn’t seem to formally be part of IRIS-HEP. �He should be. Maybe a fellow?

9 of 42

How are each of these projects/activities connected to, being informed by or planning on delivering (eventually) to the experiments? Are there relevant blueprint meetings or workshops that should happen to make progress?

ServiceX is designed to facilitate high-performance array-based analyses. It does this by allowing users to construct sophisticated in-place data queries via an analysis description language, performing on-the-fly data transformations into convenient analysis formats, and connecting the output to future analysis facilities.

  • It is closely connected to both ATLAS and CMS, and features the ability to handle experiment-specific input formats like xAOD (ATLAS) and miniAOD (CMS).
  • The developers are currently in close contact with members of each experiment to ensure the service will improve time-to-insight in both cases, and to make it possible to easily develop additional transformers for new input formats.
  • While ServiceX is currently in a prototyping phase, it seeks to reach production version 1.0 by late May, at which point the service will benefit significantly from formal blueprint meetings with both collaborations.

9

10 of 42

What would be potential Year 3 milestones for each of the projects? (First ideas, to be iterated with PIs and the whole team as this process moves forward.)

  • Integration of func_adl specification for variable definition and selections with the emerging specification for high-level TRExFitter-like analysis
  • Demonstration of differentiable analysis pipeline (eg new TRExFitter) ending with pyhf limit back-propagating through selection implemented with awkward, func_adl, etc.
    • connect to pyhf/neos demo. Need autodiff-able awkward
    • connect to histogramming projects
    • Discussion in Slack to connect this with the Sally algorithm in MadMiner
  • Documentation and training event using new tools
  • Use of new IRIS-HEP tools (MadMiner, awkward, …) for analysis in LHC experiment (may not be published by end of Y3)
  • Snowmass (tools & REANA workflows �for sensitivyt studies)

10

autodiff blueprint

(may need to send gradients back to analysis facility via ServiceX)

11 of 42

What “grand challenges” would be useful to organize involving your area during Year 3 of IRIS-HEP? How would these challenges depend on efforts from other areas of IRIS-HEP, the US LHC Ops programs or the experiments?

(Assuming not to be completed in Y3, but organized in Y3)

  • Ability to process XX TB of data in YY minutes using columnar analysis tools
  • End-to-end analysis optimization with automatic differentiation on large (~TB) simulated data sets with multiple signal and background components and systematic uncertainties. [touches on ServiceX]
  • Ability to test a new theory by reinterpreting multiple analyses and performing a statistical combination of their results in ~1 day (+ the time it takes to generate new signal MC) [touches on LHC Ops b/c would use production system]
    • extend by using excursion to streamline MC production for ATLAS or CMS reinterpretation campaign [touches on LHC Ops b/c would use production system]
  • Ability for new user to “fork” an analysis, make a modification, and obtain new results in an afternoon. [touches the experiments and training]

11

12 of 42

Are there new opportunities where effort from IRIS-HEP can make an impact? Is the alignment of the focus areas in IRIS-HEP appropriate?

  • Visualization tools (eg. altair like declarative visualization)
    • yes for AS, but expansion of scope
  • excursion alg. to streamline MC production for ATLAS or CMS reinterpretation campaign
    • yes for AS
  • Improve efficiency of event gen. with ML-inspired tools & techniques
    • yes for AS, but an expansion into “theory” tools
  • pyhf and astrophysics (HEALPix for boost histogram)
    • yes for AS, but secondary aim of IRIS-HEP
  • MadMiner like tools for EIC
    • yes for AS, but secondary aim of IRIS-HEP. Brought up at 18 mo review
  • python library for fastjet that plays well with columnar analysis
  • Documentation efforts
    • yes, aligns with “lowering barriers” goal of AS

12

13 of 42

How are projects currently managed in your area? What tools are being used? How is progress measured? How are risks recorded, identified and mitigated?

13

Progress tracked on GitHub: github.com/iris-hep/project-milestones/

14 of 42

Are the metrics being used to measure success clearly defined? How well do metrics in your area measure progress, success or impact? Where can the metrics be improved or refined to better measure progress, success or impact?

Metrics listed on next page for reference.

  • Metrics are clearly defined,
    • I put some effort into defining reasonable targets for them
    • some targets are “relative’ (fraction of specs that are implemented) and some are ‘absolute’ (number of XXX)
    • Not clear if targets should be for current time or end of Y5
  • They are ok at measuring progress / effort, though not necessarily great at measuring “success”.
    • hard to assess meaningful measures of “success’ early in (R&D) phase
    • measures of user adoption & results using new tools will lag development >1 yr
  • Maybe better to enumerate AS components needed for a few vertical analysis use cases and track what fraction of those components are implemented (and connect with performance benchmarking of individual components)

14

15 of 42

Metrics

M.2.1: Number of specifications developed

  • 12 thus far. Expect maybe 50 after 5 years

M.2.2: Number of implementations for corresponding specifications

  • 5/12: ppx, func_adl, pyhf, aghast, histos, decay language

M.2.3: Throughput and latency metrics for analysis systems using SSL testbed

  • Aiming for ~10x speedup for various analysis tasks. Seeing >100x in some cases

M.2.4: List of experiments using CAP and number of analyses stored in CAP

  • 14 ATLAS analyses with workflows in CAP/REANA/RECAST-ready format

M.2.5: Number of results / papers making use of CAP/REANA

  • 3 thus far, more on the way

M.2.6: GitHub stars, forks, watch, contributor statistics

  • 12 GitHub repos
  • healthy statistics for core projects

15

eg. uproot & awkward

16 of 42

Backup

16

17 of 42

Prior to IRIS-HEP

Bulk Data Processing

Reconstruction Algorithms

Analysis Code

Analysis code in HEP is often more free-form with less organized development:

  • one-off approach limits functionality
  • slow iteration cycle
  • slow on-boarding and lack of interoperability
  • difficult to reproduce and reuse
  • primarily ROOT & C++
  • lack of developer community
  • overlapping solutions
  • data redundancy

17

18 of 42

IRIS-HEP as an Institute

Analysis Systems

ad hoc analysis code

Analysis Systems strategies:

  • improve functionality & interoperability
  • more modular, less dependence on ROOT
  • declarative: focus on what to do not how to do it
  • align with modern data science practices

18

IRIS-HEP

19 of 42

Analysis Systems

  • Develop sustainable analysis tools to extend the physics reach of the HL-LHC experiments
    • create greater functionality to enable new techniques,
    • reducing time-to-insight and physics,
    • lowering the barriers for smaller teams, and
    • streamlining analysis preservation, reproducibility, and reuse.

19

Analysis Systems projects span all stages of end-user analysis.

20 of 42

Scikit-HEP

20

A broad community project with heavy IRIS-HEP involvement.

21 of 42

A coherent ecosystem

21

One of our analysis use cases involves a vertical slice from ServiceX to final limits for a real-world ATLAS Higgs analysis. See Alex Held’s poster.

22 of 42

A coherent ecosystem

22

ServiceX

yadage

func_adl formulate

coffea

23 of 42

The Future

IRIS-HEP Focus Areas

23

Slides from Johann Brehmer’s Keynote talk at ACAT on Constraining Effective Field Theories with Machine Learning

Tight integration of

  • Simulation
  • Machine Learning
  • Statistical Inference

24 of 42

Major Activities

  • Development of declarative specifications for different stages of analysis�
  • Identification and benchmarking of traditional implementations for benchmark example use-cases that span the scope of AS�
  • Implementation of prototype components & integration
    • connection with DOMA (particularly ServiceX)�
  • Benchmarking and assessment of prototype implementations and declarative specifications for the same example use cases
    • connection with SSL (dedicated Blueprint Activity) �
  • Exploratory research in machine learning that may impact how analysis is performed�
  • Engagement with community of early adopters and developers

24

25 of 42

Connections to DOMA & SSL

ServiceX is part of DOMA’s iDDS

  • feeds data to downstream analysis systems�
  • uses components from analysis systems to:
    • read ROOT-formatted data
    • transform analysis languages
    • export data formatted for downstream analysis

25

ServiceX

Data Lake

Cached

Distribution

Analysis Systems

ServiceX is being prototyped using IRIS-HEP’s Scalable Systems Lab

  • 10 TB xAOD data from ATLAS using IRIS functional analysis description language
  • CMS example using uproot & awkward

B. Galewsky, R. Gardner, L. Gray, M. Neubauer,J. Pivarski, M. Proffitt, I. Vukotic, G. Watts, M. Weinberg

26 of 42

Milestones and Deliverables

26

27 of 42

Milestones and Deliverables

27

Progress tracked on GitHub: github.com/iris-hep/project-milestones/

28 of 42

Metrics

M.2.1: Number of specifications developed

  • 12 thus far. Expect maybe 50 after 5 years

M.2.2: Number of implementations for corresponding specifications

  • 5/12: ppx, func_adl, pyhf, aghast, histos, decay language

M.2.3: Throughput and latency metrics for analysis systems using SSL testbed

  • Aiming for ~10x speedup for various analysis tasks. Seeing >100x in some cases

M.2.4: List of experiments using CAP and number of analyses stored in CAP

  • 14 ATLAS analyses with workflows in CAP/REANA/RECAST-ready format

M.2.5: Number of results / papers making use of CAP/REANA

  • 3 thus far, more on the way

M.2.6: GitHub stars, forks, watch, contributor statistics

  • 12 GitHub repos
  • healthy statistics for core projects

28

eg. uproot & awkward

29 of 42

Community Building

  • Active participation in relevant venues:
    • HEP Software Foundation, PyHEP, CHEP, ACAT, HOW, …
    • Internal experiment meetings, IRIS-HEP topical meetings
    • Partner projects: DIANA, SCAILFIN, CERN-IT, ...
    • >100 presentations and 20 publications thus far�
  • High-profile projects that provide clarity of vision and leadership
    • Scikit-hep, uproot, awkward-array, histos, (Coffea)
    • MadMiner, AmpGen, functional analysis description language
    • pyhf, RECAST�
  • Growing community of early adopters using tools now
    • >1000 downloads / week for uproot

29

“I just wanted to express my personal awe to you and your team working so hard on a bunch of wonderful projects. The talks delivered by Johann, Lukas and Gunes were excellent! In my personal opinion it was the best part of ACAT conference.” � - Andrey Ustyuzhanin (LHCb & Yandex School of Data Analysis)

30 of 42

Training

30

31 of 42

Presentations & Publications

31

108 presentations and 22 publications thus far�

32 of 42

Value of IRIS-HEP as an Institute

IRIS-HEP as a tugboat:

  • direct and navigate large efforts in the collaborations with significant inertia�
  • take advantage of consistent presence and messaging within the large collaborations�
  • Examples:
    • pythonic analysis tools
    • software practices
    • industry-standards

32

33 of 42

Value of IRIS-HEP as an Institute

IRIS-HEP as a lighthouse:

  • provide cohesive, long-term vision for how software should evolve to meet needs of HL-LHC�
  • take advantage of holistic perspective of the institute�
  • Examples:
    • columnar analysis
    • declarative programming
    • preservation & reuse

33

34 of 42

Highlight

  • The field is at a tipping point, DIANA/DASPOS/IRIS-HEP contributions have been transformational.

  • First results using the RECAST reinterpretation framework and publishing full statistical likelihoods (using pyhf)

34

ROOT: 10+ hours

pyhf: < 30 minutes

35 of 42

Highlight

35

Featured on CERN homepage

36 of 42

Highlight

IRIS-HEP Focus Areas

36

Finalist for best paper award at SC19 (Super Computing)

  • Largest scale Bayesian inference ever using in a universal probabilistic programming language
    • Applied to complex LHC Physics use case: Sherpa code base of ~1M lines of code in C++�
  • 230x speedup for synchronous data parallel training of a 3DCNN-LSTM neural network
    • 1,024 nodes (32,768 CPU cores)
    • 128k minibatch size, largest for this NN architecture
    • One of the largest-scale use of PyTorch built-in MPI �
  • Novel protocol (PPX) to execute & control existing, large-scale, scientific simulator code bases

37 of 42

Beyond HEP

  • Co-organized by IRIS-HEP members
  • 188 Paper submissions
  • University + Industry + Labs
  • Diversity in topics and participants

37

38 of 42

Beyond HEP

Machine learning & statistical techniques originally developed for LHC now being used to probe Dark Matter with gravitational strongly lensing

38

arXiv:1909.02005 published in The Astrophysical Journal.

39 of 42

Beyond HEP

Collaboration with DeepMind on AI techniques inspired by physics

Relevant for:

  • HEP
  • nuclear physics (lattice QCD)
  • cosmology
  • geology
  • protein structure
  • robotics

39

Protein figure from Boomsma [https://doi.org/10.1073/pnas.0801715105]

in collaboration with

40 of 42

Beyond HEP

40

See Sebastian Macaluso’s poster highlighting exploratory machine learning projects.

  • examples of use-inspired research
  • connections to natural language processing (NLP) and genomics

https://arxiv.org/abs/2002.11661

41 of 42

Beyond HEP

Collaboration with DeepMind on AI techniques inspired by physics

Models that incorporate physics generalize to unseen systems (zero-shot learning)

41

42 of 42

Beyond HEP

Collaborating with CS & astrophysics on computing models and tools to use HTC and HPC together, published as:

E. A. Huerta, R. Haas, S. Jha, M. Neubauer, D. S. Katz, "Supporting High‐Performance and High‐Throughput Computing for Experimental Science," Computing and Software for Big Science 3:5, 2019. doi: 10.1007/s41781-019-0022-7

42

Components involved in starting a Shifter job on Blue Waters (HPC). Jobs are submitted to workload manager on Blue Waters’ login nodes, which launches jobs on compute nodes. When job requests use containers, workload manager first uses Shifter runtime environment to pull an up-to-date copy of the container image from Docker Hub. This image is repackaged as a user-defined image, then pre-mounted (prologue) by the jobs on the compute nodes and unloaded post-job (epilogue).

Left: period of time during which 35 million ATLAS events were processed using 300 Blue Waters nodes. Utilization during this period averaged 81%, typical for Blue Waters. Right: backlog of queued jobs for the same period in requested nodes, with colors indicating user accounts. During this period, the queued workload never dropped below 80,000 nodes i.e., four times the number of nodes in Blue Waters. The red and blue curves below the horizontal axis are nodes available for work scavenging during this period.