1 of 30

Computing and Software Challenges

Graeme A Stewart, CERN EP-SFT

ECFA-EPS Special Session, Ghent 2019-07-13

2 of 30

Acknowledgement

  • Preparing this talk would not have been possible without the many interesting submissions to the EPPSU
  • In particular I drew heavily on talks at the Granada Workshop from Simone Campana, Ian Bird, Roger Jones, Matthias Kasemann, Maria Girone and Brigitte Vachon

Thank you!

Of course, I take responsibility for any mistakes and misunderstandings and it was my choice as to which work, in particular, to highlight

2

3 of 30

LHC and HL-LHC Challenges

  • ALICE and LHCb will have a very large increase in rate for LHC Run-3
    • This puts pressure on both CPU resources and storage
  • Move to model of data reduction and software triggers
    • Maximise physics within available resources
  • HL-LHC factor 4 in instantaneous luminosity for ATLAS and CMS (7.5 x 1034)
  • Trigger rates of 7.5-10kHz
    • Challenge of rate x complexity
  • Current plots already represent significant improvements over the estimates in the HSF Community White Paper from 2017

3

CMS Storage

4 of 30

Non-LHC Experiments

  • HENP
    • DUNE - Foresee 70PB/year by mid-2020s
    • FAIR - LHC data volumes
    • Belle II - 10PB/year RAW
  • Non-HEP
    • SKA
    • LSST
  • We will not be alone as a science at the exabyte scale
  • This is a both a threat and an opportunity

4

5 of 30

The Scale of HEP Computing

  • WLCG: an international collaboration to distribute and analyse LHC data
    • Born of the need to scale up our computing to the challenge of the LHC
    • Integrates computer centres worldwide
    • Provide resources as a single infrastructure accessible by all LHC physicists
      • LHC is about 95% of total HEP resources
  • 167 sites in 42 countries
  • ~1 million CPU cores (100€ each)
  • ~1 exabyte of storage (10-100€ per TB)
  • >2 million jobs per day
  • 10-100Gb network links

5

6 of 30

The Scale of HEP Software

  • At least 50 million lines of code
    • Each LHC experiment has about 6M lines each
  • Mostly C++, a lot of Python
  • This would cost at least €500M to develop commercially
  • A lot of significant common software
    • Event Generators
    • Detector Simulation
    • ROOT, foundational toolkit and analysis framework
  • A lot of experiment specific software
    • Even when a common solution would have been credible

6

7 of 30

Technology Evolution

7

8 of 30

Technology Evolution

  • Moore’s Law continues to deliver �increases in transistor density
    • But, doubling time is lengthening
  • Clock speed scaling failed around 2006
    • No longer possible to ramp the clock speed as process size shrinks
    • Leak currents become important source of power consumption
  • So we are basically stuck at ~3GHz clocks from the underlying Wm-2 limit
    • This is the Power Wall
    • Limits the capabilities of serial processing
  • Memory access times are now ~100s of clock cycles
    • Poor data layouts are catastrophic for software performance

8

NVIDIA Titan V GPU

US$3000, 1.5GHz

9 of 30

Decreasing Returns �over Time

  • Conclusion is that diversity of new architectures will only grow
    • We don’t know, specifically, what processors will look like in a decade
  • Best known example is of GPUs
  • But FPGAs and TPUs (Tensor Processing Units) are also used

9

10 of 30

Disk, Tape, Network

  • Tape market now dominated by a single manufacturer
    • No serious technological obstacles
    • Non-tape archival storage options are not competitive right now
  • Hard disk sizes do still grow
    • 100TB by HL-LHC
    • Time to read a disk’s worth of data increases
  • Network technology keeps improving
    • Foresee continued increases in available bandwidth and increasing capabilities (SDNs)

10

11 of 30

Hardware Evolution in a Nutshell

  • More complex future
  • Rising uncertainties over technology and prices

11

c. 2000

c. 2019

12 of 30

Challenges and Opportunities

12

13 of 30

Concurrency and Heterogeneity

  • The one overriding characteristic of modern processor hardware is concurrency
    • Doing more than one thing at a time (SIMD, a.k.a. Vectorisation; MIMD, a.k.a. multi-threading)
  • Because of the inherently parallel nature of HEP processing a lot of concurrency can be exploited at rough granularity
    • Task and job parallelism served us well for many years
  • However, the push to highly parallel processing (1000s of GPU cores) requires parallel algorithms
    • This often requires completely rethinking problems that had sequential solutions previously
  • There are a lot of possible parallel architectures on the market
    • Different CPU and GPU variants, no real common API to access them
    • To avoid lock-in need to use a wrapper (isolate the main algorithm) or a low level library

13

14 of 30

Data Layout and Throughput

  • Original HEP C++ Event Data Models were heavily inspired by the Object Oriented paradigm
    • Deep levels of inheritance, access to data through various indirections
    • Scattered objects in memory
  • Lacklustre performance was ~hidden by the CPU and we survived LHC start
  • In-memory data layout has been improved since then (e.g. ATLAS xAOD)
    • But still hard for the compiler to really figure out what’s going on
    • Function calls non-optimal
    • Extensive use of ‘internal’ EDMs in particular areas, e.g. tracking
  • iLCSoft / LCIO also proved that common data models help a lot with common software development
  • Want to be flexible re. device transfers and offer different persistency options
    • e.g. ALICE Run3 EDM for message passing and the code generation approaches in FCC-hh PODIO EDM generator

14

15 of 30

Machine Learning

  • Machine learning, or artificial intelligence, used for many years in HEP
  • Significant advances in the last years in ‘deep learning’
  • Rapid development driven by industry
    • Vibrant ecosystem of tools and techniques
    • Highly optimised for modern, specialised hardware
  • For HEP offers
    • Better discrimination, already used widely
    • Replace slow calculations with trained outputs
      • In extreme cases skip entire processing steps
  • Challenge to fully exploit these techniques and to integrate them into workflows

15

Machine learning at the energy and intensity frontiers of particle physics, https://doi.org/10.1038/s41586-018-0361-2

Use of Generative Adversarial Networks to simulate calorimeter showers, trained on G4 events (S. Vallacorsa)

16 of 30

Facilities

  • 25% of compute used by LHC experiments already comes from non-grid resources
    • Cloud Computing
    • HPC Centres
    • HLT Farms
  • These resources will likely become more important in the future
  • Exascale HPCs planned around compute accelerators
  • Key challenge is their efficient use
    • How to utilise their GPUs
    • End to end problem to optimise total throughput
    • Overcome access peculiarities per site

16

ATLAS CPU Usage 2018

ES, EU, Japan and China all planning for exascale machines

17 of 30

HEP Evolution and R&D

17

18 of 30

Storage and Data Management

  • Storage of HEP data is the main challenge in the next decade
    • Data is our main asset, and our main cost
    • No opportunistic storage
    • Petabyte level storage facilities are hard to operate
  • We have massive experience in this area
  • Active R&D into Data Organisation, Management and Access (DOMA)
    • Modernised network protocols
    • Use caches to hide latency, support CPU only sites
    • Data carousels to increase tape use with scheduled access
    • Quality of storage interfaces

18

Horizon 2020 funding of exabyte scale science infrastructure

Data Cloud Model

19 of 30

Future Shared Infrastructure

  • There is an opportunity to leverage commonality across HEP and beyond.
    • This is happening already - compromise between experiment specific and common solutions
    • Sustainability is very important
  • Most of the facilities supporting HEP and other science projects are the same.
    • The Funding Agencies do not want to deploy several computing infrastructures
  • The idea to generalize the infrastructure related aspects of WLCG and open them to more scientific communities is well received
    • Prototyped with DUNE and Belle-2

19

CERN VM

File System

Users Workshop

20 of 30

Event Generation

  • Starting the simulated events chain from theory
    • Previously was very small part of LHC computing budget �(cf. detector simulation), no pressure to optimise
  • Increasing use of higher precision to drive down errors (NLO, NNLO, …); negative weights are a serious problem
    • Greatly increases the CPU budget fraction given over to event generation
    • Possibility of sharing matrix element calculations between experiments being explored (HSF WG coordinating)
  • Theory community not rewarded for providing generators to experiments
    • Lack of expertise and incentives to adapt to modern CPU architectures
  • From the technical point of view, these codes are a good target for optimisation
    • Might even be suitable for GPUs

20

ATLAS 2018 CPU Report

21 of 30

Simulation

  • A major consumer of LHC grid resources today
    • Experiments with higher data rates will need to more simulation
  • At the same time flat budget scenarios don’t give a lot more cycles
    • So need faster simulation
  • Technical improvement programme helps (and helps everyone)
    • GeantV R&D modernises code and introduces vectorisation; serious studies of GPU porting are starting (US Exascale Computing Project), but the problem is seriously hard
  • Even this will probably not be sufficient to meet future needs
    • Will need to trade off accuracy for speed with approximate and hybrid simulation approaches
      • Combine full particle transport with faster techniques for non-core pieces of the event
  • Machine learning techniques are gaining ground, but yet to be really proven
    • Need to decide when they are good enough cf. Geant4
    • Integrating these into the lifecycle of simulation software and developing toolkits for training and inference is needed - this is a software and a computing problem

21

22 of 30

Reconstruction and Software Triggers

  • Hardware triggers no longer sufficient for modern experiments (LHCb, ALICE)
    • More and more initial reconstruction needs to happen in software
  • Close to the machine, need to deal with tremendous rates and get sufficient discrimination
    • Pressure to break with legacy code is high
    • Lots of work in rewriting code for GPUs
  • Best practice essential - data layout optimised, concurrent, async
  • Even the physics performance can improve when revisiting code
  • Real Time Analysis (HEP Version)
    • Design a system that can produce analysis useful outputs as part of the trigger decision
    • If this captures the most useful information from the event, can dispense with raw information
  • This is a way to fit more physics into the budget

22

LHCb Run2 Turbo took 25% of events for only 10% of bandwidth

23 of 30

Analysis

  • Scaling for analysis level data also a huge challenge
  • Efficient use of analysis data can come with combining many analyses as carriages in a train like model (pioneered by PHENIX and ALICE)
    • Also goes well with techniques like tape carousels
  • Reducing volume of data needed also helps hugely
    • CMS ~1kB nanoAOD - a vast difference to analysis efficiency and “papers per petabyte”
  • Improve analysis ergonomics - how the user interacts
    • Declarative models (ROOT’s RDataFrame)
      • Say what, not how and let the backend optimise
    • Containers gain ground; notebook interfaces used for training and may scale further
    • Cluster power, laptop convenience - analysis clusters (interactive ROOT on HPCs)
  • Interest in data science tools and machine learning is significant for this community - inspiring new approaches (e.g. Coffea)
    • This is an ecosystem into which HEP can, and does, contribute - knowledge transfer goes both ways

23

24 of 30

Facing the Challenges

24

25 of 30

Training and Careers

  • Many new skills are needed for today’s�software developers and users
  • Base has relatively uniform demands
    • Any common components help us
  • LHCb StarterKit initiative taken up by �several experiments, sharing training material
    • Links to ‘Carpentries’ being remade (US funded training projects)
  • New areas of challenge
    • Concurrency, accelerators, data science
    • Need to foster new C++ expertise (unlikely to be replaced soon as our core language, but needs to be modernised)
  • Careers area for HEP software experts is an area of great concern
    • Need a functioning career path that retains skills and rewards passing them on
    • Recognition that software is a key part of HEP now

25

26 of 30

Organising for the Future

  • HSF
    • Overarching umbrella organisation, at the international level (strongest in Europe and North America)
    • Builds community efforts, very inclusive; defined the Community White Paper Roadmap
  • Software Institutes
    • IRIS-HEP in US
      • NSF funded at US$25M over 5 years
      • Machine Learning, DOMA, Innovative Advanced Algorithms, Analysis
    • Should Europe do more here?
      • Traditionally labs (CERN, DESY) have played this role, but time to break out beyond HEP?
      • A lot of shared problems - critical architecture changes, new techniques affect us all
        • Value of the institute is in breaking boundaries (experiment, region, science)
      • Linking to academic experts in software engineering could be mutually very beneficial
      • Also helps us to tackle the training problem (pass on skills) and careers (better defined path) and solve practical software problems

26

27 of 30

Summary

  • The landscape has shifted significantly in the last decade
    • Concurrency, Accelerators, High-Speed Networks, Exascale, …
  • We are constantly adapting and evolving our software and computing
    • Challenges are not just for current experiments, but R&D for future detectors
  • Adopting a more radical approach involves committing a lot of human effort
    • It really pays off - improved software improves our physics
      • Poor and underfunded software is resource costly or cuts into physics
    • Efficient use of heterogeneous resources needs a critical mass of software
  • Pyramid of skills and expertise
    • Need a lot of software engineering and physics talent
    • Address training needs
    • Long term career prospects for HEP software experts need to improve
  • Huge opportunities for software to improve that we have to grasp
    • Organise around this goal - continue to reach out to industry, software engineers, other sciences

27

28 of 30

Backup

28

29 of 30

Optimal Software - The Golden Roles

  • Orienting the design around the data (with optimal layouts) is critical
  • Bulk data together and exploit concurrency where ever possible
  • Be as asynchronous as possible
    • Framework should hide latency
    • Storage systems should help
  • Transfers between host and device are expensive
    • Port blocks of algorithms, even ones where gain is small
  • The physics performance can improve when revisiting code!
    • We have a lot of legacy; revisiting the code oriented to the primary goal simplifies and improves maintainability

29

30 of 30

Summary of EPPSU Inputs

  • The EPPSU inputs that made mention of software and computing are summarised here:

https://docs.google.com/spreadsheets/d/1mjN6AaSUUFY-r_HxkKvV4E4f2cgPkEaLchEFIHm0LxA/edit?usp=sharing

30