Energy consumption during training
- Device (D) = NVIDIA A-100 GPU
- Time (T) = 100 days = 100*24 hrs = 2400 hrs
- Number of GPUs (N) = 25000
- GPUs were installed in NVIDIA HGX servers
- GPUs per server (C) = 8
- Number of servers (S)= N/C = 25000/8 = 3125
- Thermal Design Power is the power consumption of a piece of hardware under maximum theoretical load. We call it TDP.
- TDP of NVIDIA DGX server = 6.5 kW
- This means that if the DGX server runs at full power for 1 hr, it will consume 6.5 kWh.
- Power consumed by one server = T * TDP = 2400 hrs * 6.5 kW = 15600 kWh
- Total power consumed by S servers = S*T*TDP = 3125 * 15600 kWh = 48750000 kWh = 48.75 million kWh
- It’s customary to multiply the energy consumption of the hardware by the so-called power usage effectiveness i.e. PUE of the data center in which the hardware runs
- PUE describes how efficiently a datacenter uses energy.
- Assumption: GPT-4 was trained in an azure data center (because OpenAI partners with MSFT)
- Average PUE of an Azure data center = 1.18
- Hence, total power consumed by S servers * PUE of the datacenter holding the servers = energy consumed = 57525000 kWh = 57.525 million kWh