1 of 15

CSE 5 · PRINCIPLES OF AI · DISCUSSION 4

What Is an

Image Model?

pure noise

final image

Diffusion: from noise to image, in ~50 steps

Image Source: https://images.unsplash.com/photo-1610171426203-cc9f214e947c?q=80&w=1400&auto=format&fit=crop

2 of 15

OVERVIEW

Session Overview

0–5 min

Hook + setup

5–17 min

Round 1 — Same prompt, four image models

17–27 min

Round 2 — Break the model

27–35 min

Mental model — How diffusion actually works

35–45 min

Round 3 — Image to video

45–50 min

Final round — Spot the AI + closing

CSE 5 — Discussion 4 · What Is an Image Model?

2 / 14

3 of 15

ROUND 1 · 12 MIN

Same prompt, four models

CSE 5 — Discussion 4 · What Is an Image Model?

3 / 14

PROMPT "A medieval scholar reading an ancient book by candlelight, photorealistic"

YOUR MISSION

1. Pair up. Each pair takes one tool below.

2. Run the prompt above — exactly as written.

3. Screenshot the result. We'll share in 5 min.

5 min generate · 5 min share + compare

WATCH FOR

Same prompt → four personalities.

✓ Different aesthetic defaults

✓ Different strengths (text? hands? composition?)

△ A product = model + system prompt + tuning

Nano Banana

aistudio.google.com

Bing Image Creator

bing.com/create

Ideogram

ideogram.ai

LEONARDO.AI

leonardo.ai

Some sites that give you free credits:

4 of 15

ROUND 2 · 10 MIN

Break the model — find what AI can't do

CSE 5 — Discussion 4 · What Is an Image Model?

4 / 14

🤲 HANDS

Two people firmly shaking hands, all 10 fingers visible, side angle

📖 LONG TEXT

An open book showing a full page from a novel — 200+ readable words

🍎 COUNTING

Exactly 13 red apples and 7 green apples on a wooden table

🐈 SPATIAL

Dog LEFT of cat, red ball BETWEEN them, yellow bird ABOVE the cat

🍷 QUANTITY

Five wineglasses in a row: empty, ¼, ½, ¾, completely full

ANALOG TIME

An analog clock face showing exactly 11:23

YOUR MISSION Pick 2–3 challenges. Use any tool. See if the latest image generator (e.g., GPT-Image-2, Nano Banana 2) will make mistakes.

5 of 15

MENTAL MODEL

Last week — and where we go today

CSE 5 — Discussion 4 · What Is an Image Model?

5 / 14

Chat model

predict the next token

one word at a time

Diffusion model

? ? ?

6 of 15

MENTAL MODEL · STEP 1

We deliberately destroy training images

CSE 5 — Discussion 4 · What Is an Image Model?

6 / 14

Forward process →

We do this manually. The model isn't involved yet.

Each adjacent pair becomes one training example: (slightly noisier) → (slightly cleaner)

7 of 15

MENTAL MODEL · STEP 2

Train one model to undo a single step

CSE 5 — Discussion 4 · What Is an Image Model?

7 / 14

Chat model: predict the next token

Diffusion model: predict a slightly cleaner version

(repeat this training a few billion times)

And then train it to reconstruct the original image →

8 of 15

MENTAL MODEL · LOSS

How does the model actually learn?

CSE 5 — Discussion 4 · What Is an Image Model?

8 / 14

It checks its own output against the originals it was trained on.

Model's reconstruction

close, but not perfect

Original training image

we already have this

compare

LOSS = pixel-by-pixel difference → backprop, update

9 of 15

MENTAL MODEL · CONDITIONING

How prompts (and images) steer the noise

CSE 5 — Discussion 4 · What Is an Image Model?

9 / 14

The model isn't just denoising — it's denoising TOWARD a condition.

TEXT → IMAGE

Your prompt: "white cat"

↓ (prompt augmentation)

"A fluffy white Persian cat sitting on a wooden stool, golden hour lighting, photorealistic..."

WHY this exists

Training images had LONG, detailed captions.�Model learned: long captions → good images.�Short prompts just don’t perform that well then.

→ Many modern tool rewrites and enrich your prompt automatically.

IMAGE → VIDEO

Your reference image:

↓ encode

Latent code: [0.42, −0.18, 0.93, 0.71, ...]

(a numerical fingerprint of your image)

WHY this works

Training paired every video with its frame's latent.�Model learned a binding: latent ↔ matching video.

→ Your image becomes a fingerprint, model recreates the video.

Conditioning = the model has learned to bind input (text or image) to output.

10 of 15

MENTAL MODEL · LIMITS

Why it failed before

CSE 5 — Discussion 4 · What Is an Image Model?

10 / 14

🤲 BAD HANDS

Hands are okay-ish now — but multi-person interaction still breaks.�Each step only checks 'looks locally plausible.'

📖 LONG-FORM TEXT

Short signs are solved. But 200 words of a novel?�Each character carries tiny pixel signal — too many places to fail.

🍎 WRONG COUNT

'13 apples' ≈ '14 apples' in noise space.�It's doing texture statistics, not arithmetic.

🐈 SPATIAL CONFUSION

'Left of', 'between', 'above' are concepts — not pixel patterns.�The model sees patches, not relationships.

They're consequences of the architecture but most of these patterns seem to be solved now!

11 of 15

MENTAL MODEL · TO VIDEO

Video = images that stay consistent across time

CSE 5 — Discussion 4 · What Is an Image Model?

11 / 14

Same trick. One more dimension.

8 consecutive frames from a video generated in one shot

t1

t2

t3

t4

t5

t6

t7

t8

Same scholar. Same room. Same lighting. Generated as one block — not eight independent images.

Image Source: Veo 3.1 – Google AI Studio

12 of 15

ROUND 3 · 10 MIN

Image → video

CSE 5 — Discussion 4 · What Is an Image Model?

12 / 14

YOUR MISSION

1. Pick your favorite image from earlier rounds.

2. Upload to runwayml.com or hailuoai.video

3. Add a motion prompt:

"the woman slowly turns her head"

"the candle flickers, the scholar nods"

"wind blows through the trees"

4. Wait 2–3 min. Discuss while you wait →

WHILE YOU WAIT

Three questions for the wait time:

• Why are these all 5–10 seconds?

→ cost, temporal consistency

• Whose motion looks most natural?

→ different training data, different priors

• Could this make a 3-min film?

→ today these are shot-level tools, not film-level

Runway runwayml.com

Hailuo hailuoai.video

13 of 15

FINAL ROUND · 5 MIN

Spot the AI — which 5 are generated?

CSE 5 — Discussion 4 · What Is an Image Model?

13 / 14

14 of 15

FINAL ROUND · 5 MIN

Spot the AI — which 5 are generated?

CSE 5 — Discussion 4 · What Is an Image Model?

13 / 14

Image Sources: https://images.unsplash.com/photo-1610171426203-cc9f214e947c?q=80&w=1400&auto=format&fit=crop

          https://www.pexels.com/photo/fruits-on-plate-on-dark-table-10117256/

       https://www.staples.com/infinity-instruments-profuse-itc-wall-clock-20-dia-14246sv-830/product_24566778?cid=ps:gs:dot:nb:pla:furn&gad_source=1&gad_campaignid=14022539736&gbraid=0AAAAACN4I7y4NkkLAQ-eSnTP78fASlaFs&gclid=EAIaIQobChMI6Y6u7ouPlAMVnylECB2ORhUqEAQYCiABEgKbH_D_BwE

    

15 of 15

TAKEAWAY

Three questions to take with you

CSE 5 — Discussion 4 · What Is an Image Model?

14 / 14

HEALTHCARE

If a medical image could be AI-generated, what does that mean for telehealth and remote consultations?

BUSINESS

Does your company need to label AI-generated marketing material? In which contexts is disclosure required?

HUMANITIES

When a 'photograph' may have no real subject — does photography still exist as an art form?

AI image generation isn't replacing photography. It's splitting "image" into two —�records of reality, and imagined images.