Energy consumption during training

  1. Device (D) = NVIDIA A-100 GPU
  2. Time (T) = 100 days = 100*24 hrs = 2400 hrs
  3. Number of GPUs (N) = 25000
  4. GPUs were installed in NVIDIA HGX servers
  1. GPUs per server (C) = 8
  2. Number of servers (S)= N/C = 25000/8 = 3125
  1. Thermal Design Power is the power consumption of a piece of hardware under maximum theoretical load. We call it TDP.
  1. TDP of NVIDIA DGX server = 6.5 kW
  1. This means that if the DGX server runs at full power for 1 hr, it will consume 6.5 kWh.
  1. Power consumed by one server = T * TDP = 2400 hrs * 6.5 kW = 15600 kWh
  1. Total power consumed by S servers = S*T*TDP = 3125 * 15600 kWh = 48750000 kWh = 48.75 million kWh
  1. It’s customary to multiply the energy consumption of the hardware by the so-called power usage effectiveness i.e. PUE of the data center in which the hardware runs
  1. PUE describes how efficiently a datacenter uses energy.
  1. Assumption: GPT-4 was trained in an azure data center (because OpenAI partners with MSFT)
  1. Average PUE of an Azure data center = 1.18
  1. Hence, total power consumed by S servers * PUE of the datacenter holding the servers = energy consumed = 57525000 kWh = 57.525 million kWh