| A | B | I | K | L | M | N | O | P | Q | |
|---|---|---|---|---|---|---|---|---|---|---|
1 | No. of nodes | No. of H100's | Total switches | Facility & Ops Lifetime Cost | H100 Node Price | Switch Price | Infiniband Network Connector & Cable | Total H100 & Infiniband Cost | Avg. Node Raw Cost | Avg. H100 Raw Cost |
2 | 2 | 16 | 0 | $100,000 | $620,000 | $0 | $12,960 | $732,960 | $366,480 | $45,810 |
3 | 64 | 512 | 8 | $3,200,000 | $19,840,000 | $288,000 | $675,840 | $24,003,840 | $375,060 | $46,883 |
4 | 128 | 1,024 | 48 | $6,400,000 | $39,680,000 | $1,728,000 | $2,621,440 | $50,429,440 | $393,980 | $49,248 |
5 | 256 | 2,048 | 96 | $12,800,000 | $79,360,000 | $3,456,000 | $5,242,880 | $100,858,880 | $393,980 | $49,248 |
6 | 512 | 4,096 | 192 | $25,600,000 | $153,600,000 | $6,912,000 | $10,485,760 | $196,597,760 | $383,980 | $47,998 |
7 | 1,024 | 8,192 | 384 | $51,200,000 | $307,200,000 | $13,824,000 | $20,971,520 | $393,195,520 | $383,980 | $47,998 |
8 | 2,048 | 16,384 | 768 | $102,400,000 | $634,880,000 | $27,648,000 | $41,943,040 | $806,871,040 | $393,980 | $49,248 |
9 | 4,096 | 32,768 | 2,560 | $204,800,000 | $1,269,760,000 | $92,160,000 | $124,518,400 | $1,691,238,400 | $412,900 | $51,613 |
10 | 8,192 | 65,536 | 5,120 | $409,600,000 | $2,539,520,000 | $184,320,000 | $249,036,800 | $3,382,476,800 | $412,900 | $51,613 |
11 | 13,107 | 104,858 | 8,192 | $655,360,000 | $4,063,232,000 | $294,912,000 | $398,458,880 | $5,411,962,880 | $412,900 | $51,613 |
12 | 16,384 | 131,072 | 10,240 | $819,200,000 | $5,079,040,000 | $368,640,000 | $498,073,600 | $6,764,953,600 | $412,900 | $51,613 |
13 | 32,768 | 262,144 | 20,480 | $1,638,400,000 | $10,158,080,000 | $737,280,000 | $996,147,200 | $13,529,907,200 | $412,900 | $51,613 |
14 | 65,536 | 524,288 | 40,960 | $3,276,800,000 | $20,316,160,000 | $1,474,560,000 | $1,992,294,400 | $27,059,814,400 | $412,900 | $51,613 |
15 | ||||||||||
16 | Prices and projections were all done as per: Aug 2024 | |||||||||
17 | H100 price is assumed to be of USD $310k. This includes the infiniband cards. For full 3.2Tbps interconnect (8 x 400Gbps) You can find the H100 quotation from thinkmate at USD $317k here: https://web.archive.org/web/20240822233446/https://www.thinkmate.com/quotation-request?a=YToxOntzOjI6ImlkIjtpOjc0OTY2Mzt9 | |||||||||
18 | 64 Port 400Gbps Infiniband Switch, is assumed to $36k. This is based on the FS.com varient which is cheaper then the nvidia official varient (approx 50k): https://web.archive.org/web/20240822234224/https://www.fs.com/products/206473.html?now_cid=4887 | |||||||||
19 | For networking, we are assuming, the following adapters and cables. Cable length standardised to 30M for simplification (you will need much much longer cables for the larger installs) - 400G Transceiver ($700): https://www.fs.com/products/200963.html - 800G Transceiver ($800): https://www.fs.com/products/205113.html - 30M MTP cable ($220): https://www.fs.com/products/224209.html?attribute=98411&id=3616836 | |||||||||
20 | We use the Fat-Tree network topology (suggested setup according to nvidia guidelines), optimized for lowest switch count <= 64 Nodes : 8 x 1 Level Fat Tree Network Plane <= 2048 Nodes : 8 x 2 Level Fat Tree Network Plane <= 65,536 Nodes : 8 x 3 Level Fat Tree Network Plane | |||||||||
21 | 10 gbe networking+switch for internet connectivity, and 1 gbe networking+switch for IPMI, is less than 1k per node, and is zero-ed out, and assumed to be part of the facility cost | |||||||||
22 | We use an average of 50k per H100 for labour cost, rack space facilities, internet, facility cooling, parts replacement, and other hardware setup, etc. This is considered conservative, due to the high cost of electrical generation and cabling required (which is out of my expertise), And the incredibly high failure rate of the H100 SXM cards (<99.9%). To illustrate the sheer scale of energy being consumed, each H100 node, can power on average 4 US households. The energy consumption of some of these clusters, are comparable with small to medium size US cities. And require dedicated cabling and cooling (which aint cheap) Nvidia uses a projection of facility and operation cost being 50% of total cost here: https://s201.q4cdn.com/141608511/files/doc_presentations/2023/Oct/01/ndr_presentation_oct_2023_final.pdf | |||||||||
23 | ||||||||||