1 of 35

Sustainable Computing in an Era of Rising Hardware Costs and Slowing Per-Core Progress

Luca Atzori, Ian Fisk, Maria Girone

2 of 35

Sustainable

”Sustainable”

  • Normally means discussions of how to sustain everything (reducing carbon, power, etc.). I’m going to include sustaining the mission

Money

  • Computing budgets are assumed to be flat

Time

  • We need to be able to calculate things in reasonable lengths of time

Technology

  • We need the right technology to solve the challenges we face

3 of 35

Outline

Computing Evolution

  • Computing processing and storage capacity and performance advances
  • Following long trends and driven by technology improvements

Replacement Models

  • Motivation for the models we have

Cost Evolution

  • Historically computing costs have remained stable or slowly decreased
  • Not anymore

Impact

  • How does this change how/when we replace computers?
  • How does it change what resources we use?
  • How does it change our expectations for how many resources we have?

4 of 35

Computing Evolution

4

5 of 35

Denard Scaling

Computer clock speed rose exponentially from 1970 to 2005

  • In a 15-year period the speed of computers could increase by a factor of 100

In the last 20 years, computer clock speeds have increased by a factor of 2-3

  • Modern server systems are now ~4GHz
    • Close to the 6-10GHz max possible with silicon technology and the limits of the speed of light

Performance of the individual cores increases with the wider instruction sets and more calculations per clock cycle

  • Linear and not exponential improvements

6 of 35

Multicore

Unable to increase the clock and still able to increase the density of transistors and size of the silicon, multi-core CPUs became standard

  • Computing capacity increases with each additional core
  • 256 Core CPUs will be available next year
    • Memory bandwidth per core has largely been maintained
      • 12-16 DIMM channels per socket
    • Memory per core can also be maintained with increasing DIMM size
    • IO per system also scales as 400Gb/s networks per system are available

This pushed many processes per system or highly parallel code

7 of 35

Watts

Increasing the density of silicon reduces the watts per core as the feature size decreases and the efficiency improves

  • These improvements are flattening

A modern CPU socket is 400W-500W

  • Many layers of silicon drive the need for even more cooling

The amount of air a fan can move goes as the square of the radius

  • A fan in 1U server spins 4 times as fast as a fan in the 2U server
  • A 500W CPU might have 250W of power used blowing air over it.

Drives toward direct liquid cooling

8 of 35

Enter the GPU

First GPU used in a Top500 Supercomputer was TITAN in 2012

    • Built by Oak Ridge
    • 16k K20 NVIDIA GPUs
    • 8MW
    • Number 1 on Top500
    • 17PFlop/s

GPUs were originally designed to apply rotation matrices to objects for graphics rendering

  • A K20 had 2500 CUDA cores to parallelize matrix multiplication (A B200 has 18k cores)
  • It is a linear algebra accelerator

9 of 35

GPU Performance per Watt

The very first Green500 list had a GPU accelerated system at the top

  • 3.2GFlops/W for NVIDIA K20
  • This beats a Xeon Phi by 30%

The 2025 list the top 100 slots are GPU accelerated

  • 73GFlops/W Grace-Hopper
  • Top CPU based systems is Fujitsu at 16GFlops/W
  • Top X86 based system is 192 Core AMD at 13GFlops/W

2013

2013

Today

The concentration from the manufacturers has been on performance

10 of 35

GPU Advances

The FP32 performance of a modern GPU has increased at 25% a year for the last decade

  • Outpacing CPU improvements

11 of 35

GPU Leveling

AI can benefit from low precision calculations

  • There is a limit amount of space on the silicon,

12 of 35

GPU and CPU Watts

Modern processing devices use a lot of power

  • 500W CPUs
  • 1000W GPUs with 2000W GPUs on the roadmap

They are effectively impossible to cool with air

  • Heat capacity of the water is 4000 times that of air
  • Flow rate is proportional to the surface area of the fan, so fan speed increases with r2 this uses a lot of power
  • Fans are moving parts and wear out

Large scale HPC installations and AI clusters are all moving to DLC

13 of 35

Status

Evolution of system improvement is slowing

  • 15% annually for CPUs
  • 25% annually for GPUs

Efficiency in terms of Flops/W has improved, but the rate of improvement is also slowed

  • GPUs are roughly 5 times as efficient as CPUs
  • The focus of manufactures has been on performance
    • We have seen dramatic increases in Wattage

Our economic models are based on costs remaining constant or slowly decreasing

  • When to replace
  • What technology to use

14 of 35

Replacement Models

15 of 35

Replacement Models CPUs

How did we get to the 5-6 year replacement cycle for CPUs?

Historical

  • When computers performance doubles every 18 months, after 5 years the old gear is 10% the performance of the new

Practical

  • 5 years is about the max a company will sell a support contract for
  • Air cooled systems have a lot of moving parts

Operations

  • If a compute node is 800W, ~7kWh
    • At 25 centime a kWh would be 1.75kCH a year
    • At a PUE of 0.4, the cooling would be 700CH a year
  • At 10k CH a node, operations costs as much as the node by year 4
  • Even at 15% improvement per year, after 5 years a new system is 50% more powerful with less energy per core.

16 of 35

Replacement Models GPUs

GPUs are typically on a 3-4 year placement cycle

Historical

  • GPUs have doubled every 18 months and GPU memory has increased at a similar rate
  • Older gear is not only slower, it may be unsuitable

Practical

  • 39 months is the maximum warranty NVIDIA currently offers

17 of 35

Cost Evolution

18 of 35

Current Investments in AI

US industry is expected to invest $500B-$700B in data centers, AI Facilities, and research this year

  • This investment is concentrated in the top hyperscalers
  • NVIDIA announced they have ~$1T in orders for the next 12 months
    • Includes worldwide sales

The US government will invest roughly $3—$5B in AI research

It’s not just that we aren’t driving anymore, we’re barely influencing

19 of 35

Impact Memory and Storage

GPUs need memory to store large and complex models

  • To make effective use of those models tremendous amounts of RAM and fast storage are needed
  • As the focus has shifted from training to profitably using the models for inference, the demands for this storage have increased dramatically

AI hyperscale installations are buying all the memory and fast storage

  • Exacerbated by the focus on HBM for GPUs

20 of 35

Impact Memory

Micron makes memory and their stock is up nearly 700% in the last year

  • Memory prices have increased by roughly a factor of 5 in the last 6 months

It is not possible to buy 128GB MR DIMMs in 2026

  • All production is spoken for
  • 32GB MR DIMMs will be end of life this summer as production focuses on more expensive large devices

All the Flatiron Purchase orders from the fall were cancelled by the manufacturer because they would lose too much money

  • Our order placed in September was $6.5M
  • In February is was repriced at $13M
  • 3 weeks later it was priced at $19M

21 of 35

A tale of two drives

The hard disk

  • The venerable tool of HEP computing
  • Slowly increasing with changes in technology and number of platters
  • Assembled and production capacity limited by labor and demand
  • Cost has increased by about this year 10% and capacity by 15%

The solid-state drive

  • Rapidly increasing driven by increases in density and number of layers
  • Production limited by demands on fabrication facilities
  • Cost has increased by a factor of 3-5 since January

22 of 35

Costs

  • Recently, I received a quote for 200PB of SSD space
    • $55M dollars in December 2025
    • $96M in January of 2026

  • Similar percentage increase in the components we use for processing and service nodes
    • Roughly a factor of 2 increase for the same size

  • Increases are worse at the high end of the scale

23 of 35

HPC/HTC Computing

8-way NVIDIA HGX Node

  • $350k (has been $250k last year)
    • Increase of 40%
  • 1 set of RAM
  • Flatiron has 36 of these

HPC/HTC Node

  • $95k (had been $32k last year)
    • Increase of 300%
  • 1 set of RAM
  • Flatiron wanted to buy 200 of these

It’s easier and more affordable to buy AI optimized hardware than HPC/HTC

24 of 35

GPU Cost Calculus

Previously

  • The GPU server was ~10 the cost to buy and 10 times the power to operate of a CPU only machine
  • CPU servers were much more common and general purpose
  • Unless your application was 10-20 times faster on the GPU it didn’t make economic sense

Now

  • ~4 times the cost to buy
  • CPU server are still more common and general purpose but have much lower investment from industry
  • Now a factor of 3-5 improvement is sufficient

Changing memory costs changes the economics of what to use

25 of 35

Possible Evolution

It is very hard to buy computers this year

  • The shutdown for HL-LHC is well timed

What will it look like moving forward

The AI bubble completely bursts

  • A lot of companies go out of business
  • We buy computing capacity at a steep discount

The AI investment accelerates

  • Components get even more expensive

It stops getting worse

  • They build new fabs
  • This becomes the new pricing

26 of 35

Impact

26

27 of 35

New Status

  • The computers that have sustained the program for 3 decades have become twice as expensive in the last six months
    • Driving by components also needed for AI
      • HEP needs ~3GB/core

  • The processors we depend on are evolving the slowest
    • Worse efficiency
    • Smaller investments
    • Slower evolution

28 of 35

Impact: Resources

  • The experiments already have an aggressive R&D to try to fit into 10% and 20% evolution curves
    • If the computers are twice as expensive, then the lower line is maximum one would expect even with 20% evolution
    • We would not have enough resources without more money, or a change of direction

29 of 35

Impact: Replacement Models

With a significant increase in cost and a general slowing of improvement the model for replacement needs to be revisited

8-10 year cycles for CPU based systems

  • Need to understand how to operate and maintain systems for longer
    • Migration to direct liquid cooling
      • Fewer moving parts, higher operational efficiency, lower failure rates due to more consistent temperatures
  • User expectations on what is considered an old system
  • Increase hosting capacity because we will be running a larger mix of old and new systems

30 of 35

Transitions

I am reminded of a previous time where CERN was dependent on an old and slowly evolving paradigm, which was getting expensive

  • Research projects between industry and science like CERN openlab helped facilitate the transition

31 of 35

Impact: Flexibility

  • We could ask for more money
    • Asking for money because you don’t want to evolve is not historically a winning strategy

  • We need to be more flexible on the hardware architectures
    • Access to more efficient and more rapidly evolving GPUs
    • Better alignment with industry investment
    • Ability to access HPC facilities and shared computing resources

  • Need broader adoptions of portability libraries
    • We need to care less about the underlying hardware

  • We need to understand how many of our applications can be cast as AI applications
    • This is where the investment is

32 of 35

32

33 of 35

Impact

The calculus for when to replace systems needs to be completely redone

  • Current model assumes gear is more capable for the same amount of money
  • In the current environment of rapidly rising costs with small increases, our old machines will be valuable for longer

Prices are changing rapidly. In the US vendor quotes are now valid for 2 weeks and no one will commit to a price until it ships

  • Our whole bidding and request for proposals model no longer works

Our estimates for what hardware costs needs to be revised

The cost benefit of GPUs needs to be recalculated

We are unlikely to be able to grow as much as we would like.

34 of 35

Outlook (1/2)

It’s a difficult time

  • No science is a driver of the computing they rely on
    • We don’t control where investments are made
  • We don’t influence the economics
    • We are tiny by comparison to hyperscalers, clouds, etc.

AI is driving, and like any disruptive technology its driving rather recklessly

35 of 35

Outlook (2/2)

To sustainably sustain our mission something will have to give

  • We will need to get more resources
    • It’s not obvious that maintaining a flat budget is possible

  • We will need to expand the time with fewer resources
    • Make allocations on shared resources

  • We will need to expand what we can use for computing and where we work