1 of 18

Transforming Generative AI �from Unsustainable to Attainable��Sree Ganesan�VP Product, d-matrix.ai

2024 © d-Matrix

1

2 of 18

The Exploding World of Generative AI

2024 © d-Matrix

2

Microsoft

Inf/day

1

Trillion

Meta

Inf/day

200

Trillion

Search Engines

Image Generation

Content Creation

Conversational Agents

Question Answering Systems

Code Generation

Video Generation

3D-Models/ Scenes

Digital Twins

Smart Factories

Model Repo

3 of 18

But… Demand WAY Outstrips Supply

2024 © d-Matrix

3

4 of 18

d-Matrix breaking through the barrier with In-Memory Compute

Why?

5 of 18

Cost

2024 © d-Matrix

5

$100,000,000,000 in CAPEX alone to deploy ChatGPT or Bard into every Google Search

28,936 Nvidia GPUs + $700,000/day for OpenAI to run ChatGPT-3

When asked if training GPT-4 cost $100,000,000, Altman replied, “It’s more than that.”

6 of 18

Power

2024 © d-Matrix

6

Without sustainable practices, AI will consume more energy than the human workforce by 2025.

The energy needed to power AI could account for up to 3.5% of global electricity consumption by 2030 if current practices remain unchanged.

The Generative AI Race Has a Dirty Secret

Integrating LLMs into search engines could mean a fivefold increase in computing power

and huge carbon emissions.

7 of 18

Size

2024 © d-Matrix

7

“I think we're at the end of the era where it's going to be these, like, giant, giant models. We'll make them better in other ways.”

- Sam Altman, OpenAI CEO

April 2023

Source: CSET, Georgetown University, 2022

Note: The blue line represents growing costs assuming compute per dollar doubles every four years, with error shading representing no change in compute costs or a doubling time as fast as every two years. The red line represents expected GDP at a growth of 3% per year from 2019 levels with error shading representing growth between 2% and 5%.

Bigger ≠ Better

8 of 18

Unique Challenges of Generative Inference

  • Models are large (billions of parameters) and context lengths are growing (upto 128K)

🡪 Requires more memory capacity

  • If they are too large to fit on a single device, need to parallelize across multiple devices

🡪 Requires more compute capacity

  • While prompt processing is compute bound, token generation is memory bound. With faster memory bandwidth you get faster token generation.

🡪 Requires both high memory bandwidth and high peak compute capability

All of the above contribute acutely to pain points w.r.t. cost, performance and power

2024 © d-Matrix

8

9 of 18

d-Matrix breaking through the barrier with In-Memory Compute

What

is doing about it

10 of 18

A New Computing Paradigm is Needed

2024 © d-Matrix

10

Compute

Accumulate

Accumulate

Accumulate

Multiply

Multiply

Multiply

Accumulate

Accumulate

Accumulate

Multiply

Multiply

Multiply

Memory

Digital-In-Memory-Compute

Memory

Memory

Memory

Multiply

Multiply

Multiply

Memory

Memory

Memory

Multiply

Multiply

Multiply

Accumulate

Accumulate

Accumulate

The A.I. Barrier

Traditional architecture

Low Bandwidth

High Bandwidth

architecture

11 of 18

Three Generations of Proven Silicon

Nighthawk

Jayhawk I

Jayhawk II

World’s first IMC

World’s first BoW chiplet

World’s first DIMC + chiplet

Compiler + mapper

2TB/s die-to-die bandwidth

150 TOPS/W, 150TB/s SRAM BW

12 of 18

Corsair: efficient GenAI inference

2024 © d-Matrix

12

Corsair Hardware

Aviator Software

13 of 18

Aviator Enterprise-Grade software �for easy and fast inference deployment

2024 © d-Matrix

13

Corsair Hardware

Aviator Software

Easily integrate Aviator with

open ecosystem tools or your own deployment stack

convert model to enable Corsair numerics & sparsity

distribute workload across cards and servers

compile and optimize model to run on Corsair

optimize inference runtime and model serving on Corsair

orchestrate, manage and monitor Corsair cards and clusters

14 of 18

Current focus: Datacenter Inference

2024 © d-Matrix

14

Cloud

On-Premises

15 of 18

GenAI inference: Datacenter Scale

15

Solution varies based on customer’s datacenter infrastructure, including rack height & rack power density

Inference Server (4U or 5U)

PCIe Switch

Working with OEMs to build inference servers with d-Matrix PCIe cards

CPU

CPU

PCIe Switch

16 of 18

The d-Matrix Advantage

2024 © d-Matrix

16

Circuits & Numerics

  • Digital In-Memory

Compute (DIMC)

  • Block Float Sparsity
  • Compression 

Chiplets & Advanced Packaging

  • 2D, 3D Stacking
  • Logic, Memory Co-package

Software

  • Easy to Use
  • Performant, Scalable 

Making Generative AI commercially viable

Significant benefits over GPU:

🡪  Better Throughput

🡪  Better Latency

🡪  Better TCO

17 of 18

d-Matrix breaking through the barrier with In-Memory Compute

Build With Us, Partner With Us

18 of 18

INTELLIGENCE DELIVEREDTM

 

www.d-matrix.ai

2024 © d-Matrix