Energy consumption during training

Device (D) = NVIDIA A-100 GPU
Time (T) = 100 days = 100*24 hrs = 2400 hrs
Number of GPUs (N) = 25000
GPUs were installed in NVIDIA HGX servers

GPUs per server (C) = 8
Number of servers (S)= N/C = 25000/8 = 3125

Thermal Design Power is the power consumption of a piece of hardware under maximum theoretical load. We call it TDP.

TDP of NVIDIA DGX server = 6.5 kW

This means that if the DGX server runs at full power for 1 hr, it will consume 6.5 kWh.

Power consumed by one server = T * TDP = 2400 hrs * 6.5 kW = 15600 kWh

Total power consumed by S servers = S*T*TDP = 3125 * 15600 kWh = 48750000 kWh = 48.75 million kWh

It’s customary to multiply the energy consumption of the hardware by the so-called power usage effectiveness i.e. PUE of the data center in which the hardware runs

PUE describes how efficiently a datacenter uses energy.

Assumption: GPT-4 was trained in an azure data center (because OpenAI partners with MSFT)

Average PUE of an Azure data center = 1.18

Hence, total power consumed by S servers * PUE of the datacenter holding the servers = energy consumed = 57525000 kWh = 57.525 million kWh