1 of 10

Seamless transition

from TTree to RNTuple analysis with RDataFrame

ACAT 2024�Stony Brook University

  1. CERN
  2. Princeton University
  3. Taras Shevchenko National University of Kyiv
  4. Fermi National Accelerator Laboratory

2 of 10

RDataFrame

2

RNTuple

ROOT analysis interface since 6.14 (2018):

  • Intuitive
  • Declarative and fast
  • Flexible

3 of 10

RDataFrame

3

ROOT analysis interface since 6.14 (2018):

  • Intuitive
  • Declarative and fast
  • Flexible

Today’s focus: �RNTuple + distributed RDF → seamless experience for the user

RNTuple

4 of 10

Analysis Grand Challenge

  • AGC – HEP analysis benchmarks
    • In various implementations, including with RDataFrame
    • In particular: tt̅ analysis based on CMS Open Data

4

5 of 10

Current status of AGC with RDF

  • Talk at CHEP last year
    • AGC v.0.1.0
  • Since then:
    • RDF implementation: new data format – NanoAOD (AGC v.1)
    • RDF implementation: Machine Learning inference for jet-parton assignment (AGC v.2)
  • In this talk:
    • Replicating AGC benchmark with RNTuple, including distributed execution via condor

5

6 of 10

Distributed analysis environment

  • Number of ways to run distributed RDF
  • Focus here - rediscover existing infrastructures and services in a modern way
    • SWAN
    • HTCondor pools
    • Schedule via Dask

6

cvmfs + EOS + CERN batch + ROOT → CERN Analysis Facility (?)

7 of 10

Distributed AGC with TTree and RNTuple – user side

The only change for the user - the ROOT input file!

7

8 of 10

Validation of histograms

  • Distributed analysis with RNTuple, it just works!
  • Satisfactory agreement with equivalent histograms from other execution policies
    • 100% bin-by-bin agreement for 120 histograms
    • 2 histograms with <1% disagreement because of the bin migrations

8

RDF

IRIS-HEP

9 of 10

AGC v.1 performance – TTree and RNTuple

  • Scaling tests on SWAN, AGC v.1 → Up to 64 workers

9

Speedup vs number of workers

RNTuple

TTree

10 of 10

Summary and next steps

  • Running AGC on SWAN with the HTCondor pools via Dask with both TTrees and RNTuples is smooth
    • With zero code change for the user
    • Achieving (almost) perfect agreement with available histograms
    • Sanity check: distributed execution up to 64 workers with RNTuple
  • Making RDataFrame ready for HL-LHC analyses
  • Next steps
    • Keep track of latest AGC benchmark specification
    • Include different benchmarks with existing or new TTree open data converted to RNTuple

10