1 of 46

Summary of S2I2/DOE mini-workshop on Software/Computing R&D for HL-LHC

November 28-29, 2017

1

2 of 46

Workshop Participants

Attended In-person: Mike Sokoloff, Peter Elmer, Mark Neubauer, Oli Gutsche, Liz Sexton-Kennedy, Mike Hildreth, Lothar Bauerdick, Panagiotis Spentzouris, Eric Lancon, Rob Gardner, Frank Wuerthwein, Taylor Childers, Paolo Calafiura, Kaushik DeConnected Remotely: Brian Bockelman, Andrea Dotti, Ken Bloom

2

3 of 46

Challenges for the next decade

  • HL-LHC, expected in 2026, brings a huge challenge to software and computing
    • Both rate and complexity rise

3

ATLAS estimated disk needs into the HL-LHC era

CMS estimated CPU needs into the HL-LHC era

  • Not just a simple extrapolation of Run 2
  • Resources needed will exceed those from technology evolution alone

x7

4 of 46

LHC/HL-LHC Timeline

4

5 of 46

Goals for this S2I2/DOE mini-workshop

Major theme: Coherence and Alignment of R&D activities for HL-LHC

Specific goals:

  • Review the vision for the ensemble of possible R&D efforts for the HL-LHC as articulated via the international CWP effort.
  • Articulate a vision (or “rules of engagement”) for how R&D efforts such as an NSF S2I2 would interact with the US-LHC Operations programs and DOE efforts (in the context of the full, international efforts).
  • Discuss the broad scope of relevant capabilities and current DOE and NSF funded efforts.

5

6 of 46

Process to arrive at “Community White Paper” (CWP)

  • Set of international workshops that involve US Universities and National Labs.
    • Driven by the HEP Software Foundation
    • Charged by the WLCG
    • Involving the US LHC operations program, the DOE labs, and the broader LHC university community in the US.
  • Leading to a set of broadly agreed upon white papers on the R&D challenges for software & computing for the HL LHC

6

7 of 46

Outcomes of this mini-workshop

  • Commitment to joint blueprint activities
  • Commitment to joint project coordination across DOE-HEP, DOE-ASCR, NSF-EPP, and NSF-OAC funded activities.
    • Initial exploration and agreement on complementarity of DOE and NSF supported activities.
  • Initial agreement on S2I2 governance

7

8 of 46

Commitment to Joint DOE and NSF Blueprint Activity

  • Drive the evolution of R&D efforts to address the software & computing challenges of the HL LHC, co-sponsored by:
    • US LHC Ops program
    • S2I2
    • OSG
    • CCE

  • Involving the DOE facilities, and key personnel at both DOE labs and US Universities.
  • Long term sustained set of workshops to drive coherence across projects and experiments.

8

Blueprint Process

HEP Researchers

(University, Lab, International)

LHC Experiments and US LHC Ops programs

S2I2 Software Institute

Resource providers, DOE Labs,

OSG, HPC Facilities

9 of 46

R&D Partnerships

NSF and DOE partnership, as informed by the blueprint process, will be essential to the success of the HL-LHC R&D efforts

9

DOE Labs and Universities

S2I2: Software Institute

DOE-HEP Center for Computation Excellence

DOE-SciDAC

10 of 46

Impact Criteria in CWP for Evaluating HL-LHC R&D Areas

Impact - Resources/Cost: Will efforts in this area lead to improvements in software efficiency, scalability and performance and make use of the advances in CPU, storage and network technologies, that allow the experiments to maximize their physics reach within their computing budgets?

Impact - Physics: Will efforts in this area enable new approaches to computing and software that maximize, and potentially radically extend, the physics reach of the detectors?

Impact - Sustainability: Will efforts in this area significantly improve the long term sustainability of the software through the lifetime of the HL-LHC?

10

11 of 46

S2I2-Specific Criteria for Focus Area Choices

Interest/Expertise: Does the U.S. university community have strong interest and expertise in the area?

Leadership: Are the proposed focus areas complementary to efforts funded by the US-LHC Operations programs, the DOE, or international partners?

Value: Is there potential to provide value to more than one LHC experiment and to the wider HEP community?

Research/Innovation: Are there opportunities for combining research and innovation as part of partnerships between the HEP and Computer Science/Software Engineering/Data Science communities?

11

12 of 46

12

Evolved by Blueprint Activity

S2I2 Focus Areas (highest-priority areas for initial S2I2 investment)

13 of 46

S2I2 Management and Governance

13

14 of 46

DOE Role in software & computing R&D for HL LHC

  • Ramp up DOE support of software & computing R&D effort in the US LHC Ops program.
  • Continue support of software & computing R&D effort at the DOE national labs and universities.
  • ASCR LCF investments provide a large opportunity to meet significant parts of the computing challenges of the HL LHC. To be able to capitalize on this opportunity requires both effort in HEP, and commitment from ASCR to inform future LCF designs and operational model to meet HEP needs, e.g. via an ECP for LHC.

14

15 of 46

A Coordinated Ecosystem for HL-LHC R&D

LHC and HL-LHC success requires an elaborate software ecosystem, for which significant evolution will be required for the HL-LHC.

Multiple R&D efforts must be coordinated to achieve coherence and alignment between a multitude of stakeholders and effort providers, US and international. Strong DOE/NSF partnerships will be required. A joint blueprint activity will be critical to building this coordination.

A Strategic Plan for an NSF-funded S2I2 Software Institute has been developed, with possible focus areas identified for University efforts. These were chosen to address common merit criteria and complement existing DOE efforts in S&C.

A continuing blueprint process evolves this ecosystem that enables the HL-LHC physics. The DOE will be a critical partner with the S2I2 for this blueprint activity.

15

16 of 46

Vision for S2I2 Institute Role for HL-LHC S&C R&D

16

17 of 46

Backup and/or Extra Slides

17

18 of 46

Partnerships and Coherence

18

19 of 46

Partnerships: Maintaining a Common Vision

Blueprint activity:

  • Maintains the software vision of the Institute consistent with the broader HL-LHC community context
  • Key element to inform R&D priorities & transition efforts forward into deployed technologies for the experiments
  • Inclusive: co-sponsored by US LHC Ops program, DOE labs, S2I2, CCE, OSG and related university research in advanced cyberinfrastructure and data science

19

20 of 46

Partnerships: Sustainability

Backbone for sustainable software:

  • Employing best practices in sustainable software engineering
  • Deploying developed software for use in the experiments, partnering with other entities (e.g Ops programs, labs) to support and evolve the software
  • Bringing in new effort at Universities with strong attention professional development (training, mentorship, CS and industry contacts, etc).

20

21 of 46

Essential Partnerships

DOE-Scidac projects

21

22 of 46

Alignment with National Priorities

(Alignment with Exascale, NSCI, NSF OAC, trends in the broader community)

22

23 of 46

Partnership Examples

  • Partner with the HEP-CCE and SciDAC projects to utilize ASCR resources
  • Jointly sponsored post-docs and students
  • HEP-CCE working with HEP community to improve software utilization and performance on DOE HPCs
  • BigPanDA project funded by DOE-ASCR and DOE-HEP in 2012 and renewed in 2016 by DOE-ASCR to focus on usage of Titan for HEP and other broad data science communities

23

24 of 46

Partnership Example - DOE-ASCR

  • BigPanDA project funded by DOE-ASCR and DOE-HEP in 2012 for BNL and UTA to explore the use of DOE LCF for HEP simulations
  • Project renewed in 2016 by DOE-ASCR, with expanded partnership between BNL, UTA, Oakridge and Rutgers to focus on usage of Titan for HEP and other broad data science communities

24

25 of 46

Community White Paper inception

  • HEP Software Foundation driven effort from May 2016 HSF Meeting at LAL
    • describe a global vision for software and computing R&D for the HL-LHC era and HEP in the 2020s
    • A step towards the Computing Technical Design Reports (CTDRs) for CMS and Atlas at the HL-LHC�
  • Formal charge from the WLCG in July 2016
    • Anticipate a "software upgrade" in preparation for HL-LHC
    • Identify and prioritize the software research and development investments
      • to achieve improvements in software efficiency, scalability and performance and to make use of the advances in CPU, storage and network technologies
      • to enable new approaches to computing and software that could radically extend the physics reach of the detectors
      • to ensure the long term sustainability of the software through the lifetime of the HL-LHC

25

26 of 46

Starting the process

  • Started to organise into different working groups at the end of 2016�
  • Kick-off workshop 23-26 January 2017, San Diego
    • 110 participants, mainly US + CERN
    • 2.5 days of topical working group meetings
    • Extensive notebooks of initial discussions�
  • Groups held workshops and meetings in the subsequent months
    • Broadening the range of participation
    • Some invited non-HEP experts to participate

26

27 of 46

Partnership Example: Analysis Systems

27

28 of 46

Example: Data Organization, Mgmt and Access

28

29 of 46

Many HSF/CWP and S2I2 Topical Workshops

29

30 of 46

Concluding the process

  • Workshop in Annecy 26-30 June started to draw the process to a close
    • 90 Participants: 48 US, 42 Europe (of which 20 from CERN)
  • 13 working groups presented their status and plans
  • Substantial progress on many Community White Paper chapters
    • WGs used the workshop to make further progress on writing
  • Individual working groups produced dedicated topical white papers, plus the combined Community White Paper itself. Goal is to wrap up by Xmas.
  • Links to all documents: http://hepsoftwarefoundation.org/activities/cwp.html

30

31 of 46

Detector simulation

  • Simulating our detectors consumes huge resources today
    • Remains a vital area for HL-LHC and intensity frontier experiments in particular�
  • Main R&D topics
    • Improved physics models for higher precision at higher energies (HL-LHC and then FCC)
      • Hadronic physics in LAr TPCs needs to be redeveloped
    • Adapting to new computing architectures
      • Can a vectorised transport engine be demonstrated to work in a realistic prototype?
    • Fast simulation - develop a common toolkit for tuning and validation
      • Can we use Machine Learning profitably here?
    • Geometry modeling
      • Easier modelling of complex detectors, targeting new computing architectures

31

32 of 46

Software trigger and event reconstruction

  • Move to software triggers is already a key part of the program for LHCb and ALICE already in Run 3
    • ‘Real time analysis’ increases signal rates and can make computing much more efficient (storage and CPU)�
  • Main R&D topics
    • Controlling charged particle tracking resource consumption and maintaining performance
      • Do current algorithms’ physics output hold up at pile-up of 200 (or 1000)
      • Can tracking maintain low pT sensitivity within budget?
    • Improved use of new computing architectures
      • Multi-threaded and vectorised CPU code
      • Extending use of GPGPUs and possibly FPGAs
    • Robust validation techniques when information will be discarded
      • Using modern continuous integration, tackling multiple architectures with reasonable turnaround times

32

33 of 46

Data analysis and interpretation

  • Today we are dominated by many cycles of data reduction
    • Aim is to reduce the input to an analysis down to a manageable quantity that can be cycled over quickly on ~laptop scale resources
    • Key metric is ‘time to insight’�
  • Main R&D topics
    • How to use the latest techniques in data analysis that come from outside HEP?
      • Particularly from the Machine Learning and Data Science domains
      • Need ways to seamlessly interoperate between their data formats and ROOT
        • Python is emerging as the lingua franca here, thus guaranteeing PyROOT is critical
    • New Analysis Facilities
      • Skimming/slimming cycles consume large resources and can be inefficient
      • Can interactive data analysis clusters be set up?
    • Data and analysis preservation is important

33

34 of 46

Data processing frameworks

  • Experiment software frameworks provide the scaffolding for algorithmic code
    • Currently there are many implementations of frameworks, with some sharing between experiments (e.g., ATLAS and LHCb share Gaudi, Intensity Frontier experiments use art)
    • All of these frameworks are evolving to support concurrency�
  • Main R&D topics
    • Adaption to new hardware, optimising efficiency and throughput
      • We need the best libraries for this and these will change over time
    • Incorporation of external (co)processing resources, such as GPGPUs
    • Interface with workload management system to deal with the inhomogeneity of processing resources
      • From volunteer computing to HPC job slots with 1000s of nodes
    • Which components can actually be shared and how is that evolution achieved?

34

35 of 46

Data management and organisation

  • Data storage costs are a major driver for LHC physics today
    • HL-LHC will bring a step change in the quantity of data being acquired by ATLAS and CMS�
  • Main R&D topics
    • Adapt to new needs driven by changing algorithms and data processing needs, e.g,
      • The need for fast access to training datasets for Machine Learning
      • Supporting high granularity access to event data
        • Needed to effectively exploit backfill or opportunistic resources
      • Rapid high throughput access for a future analysis facility
      • Processing sites with small amounts of cache storage
    • Do this profiting from the advances in industry standards and implementations, such as Apache Spark-like clusters (area of continued rapid evolution)
    • Consolidate storage access interfaces and protocols
    • Support efficient hierarchical access to data, from high latency tape and medium latency network

35

36 of 46

Facilities and distributed computing

  • Storage and computing today are provided overwhelmingly from WLCG resources
    • Expected to continue for HL-LHC, but to be strongly influenced by developments in commodity infrastructure as a service (IaaS, commercially this is usually Cloud Computing)
  • Main R&D topics
    • Understand far better the effective costs involved in delivering computing for HEP
      • This needs to be sensitive to regional variations in funding and direct and indirect costs
        • e.g. smaller sites frequently contribute ‘beyond the pledge’ resources, power costs and human resources
      • Full model is unfeasible, but providing a reasonable gradient analysis for future investment should be possible
        • Should we invest in better network connectivity or in more storage?
    • How to take better advantage of new network and storage technologies �(software defined networks, object stores or content addressable networks)
    • Strengthen links to other big data sciences (SKA) and computing science; how to share network resources

36

37 of 46

Machine learning

  • Neural networks and Boosted Decision Trees have been used in HEP for a long time
    • E.g. particle identification algorithms
  • More recently the field has been significantly enhanced by new techniques (Deep Neural Networks) and enhanced training methods
    • Very good at dealing with noisy data and huge parameter spaces
    • A lot of interest from our community in these new techniques, in multiple fields
  • Main R&D topics
    • Speeding up computationally intensive pieces of our workflows (fast simulation, tracking)
    • Enhancing physics reach by classifying better than our current techniques
    • Improving data compression by learning and retaining only salient features
    • Anomaly detection for detector and computing operations
  • Significant efforts will be required to make effective use of these techniques
    • Good links with the broader Machine Learning and Data Science communities required

37

38 of 46

Other technical areas of work

Conditions Data

  • Growth of alignment and calibration data is usually linear in time
    • Per se, this does not represent a major problem for the HL-LHC
  • Opportunities to use modern distributed techniques to solve this problem efficiently and scalably
    • Cacheable blobs accessed via REST
    • CVMFS + Files
    • Git

Visualisation

  • Many software products developed for event visualisation
    • Part of the framework, with full access to event and geometry data
    • Standalone as a lightweight solution
  • New technologies for rendering displays exist, e.g., WebGL from within a browser

38

  • These areas are both examples of where we can refocus current effort towards common software solutions
  • This should improve quality, economise overall effort and help us to adapt to new circumstances

39 of 46

Data, software and analysis preservation

  • We seem to be doing well compared to other fields
  • Challenge is both to physically preserve bits and to preserve knowledge
    • DPHEP has looked into both
  • Knowledge preservation is very challenging
    • Experiment production workflows vary in significant details
    • Variety of different steps are undertaken at the analysis stage, even within experiments
  • Need a workflow that can capture this complexity
    • Technology developments that can help are, e.g., containers
  • CERN Analysis Preservation Portal forms a good basis for further work
    • Needs to have a low barrier for entry for analysts
    • Can provide an immediate benefit in knowledge transmission within an experiment

39

40 of 46

Software development, training and careers

  • Experiments have modernised their software development models a lot recently
    • Moving to git and CMake as standard components
    • Using social coding sites (gitlab, github) coupled to Continuous Integration
  • Additional tools would benefit the community
    • Static analysis of code, refactoring code, performance measures
  • Using new tools requires investing in training for the community
    • The more commonality in the tools and techniques, the more training we can share
    • This provides preservation and propagation of knowledge
  • Our environment is becoming more complex; we require input from physicists whose concerns are not primarily in software
    • Sustainability of these contributions is extremely important
  • Recognition of the contribution of our specialists in their careers is extremely important

40

41 of 46

Second synthesis draft

  • 74 page document
  • 12 sections summarising R&D in a variety of areas for HEP Software and Computing
  • Almost all major domains of HEP Software and Computing are covered
  • Incorporated lots of community feedback on the first draft
  • Time for your feedback on second (and last) draft until December 1st
  • Please sign the paper here

41

42 of 46

HEP Software Foundation (HSF)

  • The LHC experiments, Belle II and DUNE face the same challenges
    • HEP software must evolve to meet these challenges
    • Need to exploit all the expertise available inside and outside our community for parallelisation
    • Many problems are intrinsically sequential. Need to work on reconstruction algorithms too.
  • Cannot afford any more duplicated efforts
    • Each experiment has its own solution for almost everything (framework, reconstruction algorithms…)�
  • The goal of the HSF is to facilitate coordination and common efforts in software and computing across HEP in general

42

43 of 46

The evolving technology landscape

  • Increase in single-core CPU throughput stalled
  • Many/multi core systems are the norm
    • Serial or multi-process processing is under severe memory pressure
  • Co-processors now commonplace
    • GPGPUs, FPGAs - greater throughput, far more challenging programming model
  • Wide vector registers (up to 512 bit)
  • Power a dominant factor
  • Storage capacity climbing
    • 100TB disks possible by HL-LHC, but little I/O improvement expected
  • Network capacity keeps growing

43

44 of 46

Software challenges for HL-LHC

  • Pile-up of ~200 ⇒ 100 x CPU of today
    • Moore’s law over 10 years : only a x10
  • With a flat budget, Moore’s law is the real maximum we can expect on the HW side
  • HEP software typically executes one instruction at a time (per thread)
    • Since ~2013 CPU (core) performance increase is due to more internal parallelism
    • x10 with the same HW only achievable if using the full potential of processors
      • major SW re-engineering required (but rewriting everything is not an option)
    • Co-processors like GPUs are of little use until the problem has been solved
  • Increased amount of data requires to revise/evolve our computing and data management approaches
    • We must be able to feed our applications with data efficiently�
  • « HL-LHC salvation » will come from software improvements, not from hardware

44

45 of 46

Editorial board and roadmap document draft

  • Real progress started after the summer
  • Set more realistic goals
    • Individual WG chapters by end of September
    • Overall draft roadmap paper by end of October
  • 14 working group chapters available for community review*
  • Editorial Board was set up, with the aim of encompassing the breadth of our community
  • First draft released 20 October
  • Second draft of the text was prepared by a small team within the Editorial Board
    • Released 17 November

45

  • Predrag Buncic (CERN) - Alice contact
  • Simone Campana (CERN) - ATLAS contact
  • Peter Elmer (Princeton)
  • John Harvey (CERN)
  • Benedikt Hegner (CERN)
  • Frank Gaede (DESY) - Linear Collider contact
  • Maria Girone (CERN Openlab)
  • Roger Jones (Lancaster University) - UK contact
  • Michel Jouvin (LAL Orsay)
  • Rob Kutschke (FNAL) - FNAL experiments contact
  • David Lange (Princeton)
  • Dario Menasce (INFN-Milano) - INFN contact
  • Mark Neubauer (U.Illinois Urbana-Champaign)
  • Eduardo Rodrigues (U. Cincinnati)
  • Stefan Roiser (CERN) - LHCb contact
  • Liz Sexton-Kennedy (FNAL) - CMS contact
  • Mike Sokoloff (U.Cincinnati)
  • Graeme Stewart (CERN, HSF)
  • Jean-Roch Vlimant (Caltech)

*Will place final versions on arXiv

46 of 46

Community white paper - moving forward

  • Community White Paper process has been a success
    • Engaged more than 250 people and produced more than 300 pages of detailed description in many areas
  • Summary roadmap lays out a path forward and identifies the main areas we need to invest in for the future
    • Supporting the HL-LHC Computing TDRs and NSF S2I2 strategic plan
  • Current draft will undergo a final process of refinement and conclude in a month
  • HEP Software Foundation has proved its worth in delivering this CWP Roadmap
    • Achieving a useful community consensus is not an easy process
  • We now need to marshal the R&D efforts in the community, refocusing our current effort and helping to attract new investment in critical areas
    • The challenges are formidable, working together will be the most efficacious way to succeed
    • HSF will play a vital role in spreading knowledge of new initiatives, encouraging collaboration and monitoring progress
    • Workshops planned for next year (with WLCG) and at sessions before CHEP

46