CSE 5 · PRINCIPLES OF AI · DISCUSSION 4
What Is an
Image Model?
pure noise
final image
Diffusion: from noise to image, in ~50 steps
Image Source: https://images.unsplash.com/photo-1610171426203-cc9f214e947c?q=80&w=1400&auto=format&fit=crop
OVERVIEW
Session Overview
0–5 min
Hook + setup
5–17 min
Round 1 — Same prompt, four image models
17–27 min
Round 2 — Break the model
27–35 min
Mental model — How diffusion actually works
35–45 min
Round 3 — Image to video
45–50 min
Final round — Spot the AI + closing
CSE 5 — Discussion 4 · What Is an Image Model?
2 / 14
ROUND 1 · 12 MIN
Same prompt, four models
CSE 5 — Discussion 4 · What Is an Image Model?
3 / 14
PROMPT "A medieval scholar reading an ancient book by candlelight, photorealistic"
YOUR MISSION
1. Pair up. Each pair takes one tool below.
2. Run the prompt above — exactly as written.
3. Screenshot the result. We'll share in 5 min.
5 min generate · 5 min share + compare
WATCH FOR
Same prompt → four personalities.
✓ Different aesthetic defaults
✓ Different strengths (text? hands? composition?)
△ A product = model + system prompt + tuning
Nano Banana
aistudio.google.com
Bing Image Creator
bing.com/create
Ideogram
ideogram.ai
LEONARDO.AI
leonardo.ai
Some sites that give you free credits:
ROUND 2 · 10 MIN
Break the model — find what AI can't do
CSE 5 — Discussion 4 · What Is an Image Model?
4 / 14
🤲 HANDS
Two people firmly shaking hands, all 10 fingers visible, side angle
📖 LONG TEXT
An open book showing a full page from a novel — 200+ readable words
🍎 COUNTING
Exactly 13 red apples and 7 green apples on a wooden table
🐈 SPATIAL
Dog LEFT of cat, red ball BETWEEN them, yellow bird ABOVE the cat
🍷 QUANTITY
Five wineglasses in a row: empty, ¼, ½, ¾, completely full
⏰ ANALOG TIME
An analog clock face showing exactly 11:23
YOUR MISSION Pick 2–3 challenges. Use any tool. See if the latest image generator (e.g., GPT-Image-2, Nano Banana 2) will make mistakes.
MENTAL MODEL
Last week — and where we go today
CSE 5 — Discussion 4 · What Is an Image Model?
5 / 14
Chat model
predict the next token
one word at a time
Diffusion model
? ? ?
MENTAL MODEL · STEP 1
We deliberately destroy training images
CSE 5 — Discussion 4 · What Is an Image Model?
6 / 14
Forward process →
We do this manually. The model isn't involved yet.
Each adjacent pair becomes one training example: (slightly noisier) → (slightly cleaner)
MENTAL MODEL · STEP 2
Train one model to undo a single step
CSE 5 — Discussion 4 · What Is an Image Model?
7 / 14
Chat model: predict the next token
Diffusion model: predict a slightly cleaner version
(repeat this training a few billion times)
And then train it to reconstruct the original image →
MENTAL MODEL · LOSS
How does the model actually learn?
CSE 5 — Discussion 4 · What Is an Image Model?
8 / 14
It checks its own output against the originals it was trained on.
Model's reconstruction
close, but not perfect
Original training image
we already have this
↔
compare
LOSS = pixel-by-pixel difference → backprop, update
MENTAL MODEL · CONDITIONING
How prompts (and images) steer the noise
CSE 5 — Discussion 4 · What Is an Image Model?
9 / 14
The model isn't just denoising — it's denoising TOWARD a condition.
TEXT → IMAGE
Your prompt: "white cat"
↓ (prompt augmentation)
"A fluffy white Persian cat sitting on a wooden stool, golden hour lighting, photorealistic..."
WHY this exists
Training images had LONG, detailed captions.�Model learned: long captions → good images.�Short prompts just don’t perform that well then.
→ Many modern tool rewrites and enrich your prompt automatically.
IMAGE → VIDEO
Your reference image:
↓ encode
Latent code: [0.42, −0.18, 0.93, 0.71, ...]
(a numerical fingerprint of your image)
WHY this works
Training paired every video with its frame's latent.�Model learned a binding: latent ↔ matching video.
→ Your image becomes a fingerprint, model recreates the video.
Conditioning = the model has learned to bind input (text or image) to output.
MENTAL MODEL · LIMITS
Why it failed before
CSE 5 — Discussion 4 · What Is an Image Model?
10 / 14
🤲 BAD HANDS
Hands are okay-ish now — but multi-person interaction still breaks.�Each step only checks 'looks locally plausible.'
📖 LONG-FORM TEXT
Short signs are solved. But 200 words of a novel?�Each character carries tiny pixel signal — too many places to fail.
🍎 WRONG COUNT
'13 apples' ≈ '14 apples' in noise space.�It's doing texture statistics, not arithmetic.
🐈 SPATIAL CONFUSION
'Left of', 'between', 'above' are concepts — not pixel patterns.�The model sees patches, not relationships.
They're consequences of the architecture but most of these patterns seem to be solved now!
MENTAL MODEL · TO VIDEO
Video = images that stay consistent across time
CSE 5 — Discussion 4 · What Is an Image Model?
11 / 14
Same trick. One more dimension.
8 consecutive frames from a video generated in one shot
t1
t2
t3
t4
t5
t6
t7
t8
Same scholar. Same room. Same lighting. Generated as one block — not eight independent images.
Image Source: Veo 3.1 – Google AI Studio
ROUND 3 · 10 MIN
Image → video
CSE 5 — Discussion 4 · What Is an Image Model?
12 / 14
YOUR MISSION
1. Pick your favorite image from earlier rounds.
2. Upload to runwayml.com or hailuoai.video
3. Add a motion prompt:
"the woman slowly turns her head"
"the candle flickers, the scholar nods"
"wind blows through the trees"
4. Wait 2–3 min. Discuss while you wait →
WHILE YOU WAIT
Three questions for the wait time:
• Why are these all 5–10 seconds?
→ cost, temporal consistency
• Whose motion looks most natural?
→ different training data, different priors
• Could this make a 3-min film?
→ today these are shot-level tools, not film-level
Runway runwayml.com
Hailuo hailuoai.video
FINAL ROUND · 5 MIN
Spot the AI — which 5 are generated?
CSE 5 — Discussion 4 · What Is an Image Model?
13 / 14
FINAL ROUND · 5 MIN
Spot the AI — which 5 are generated?
CSE 5 — Discussion 4 · What Is an Image Model?
13 / 14
Image Sources: https://images.unsplash.com/photo-1610171426203-cc9f214e947c?q=80&w=1400&auto=format&fit=crop
https://www.pexels.com/photo/fruits-on-plate-on-dark-table-10117256/
https://www.staples.com/infinity-instruments-profuse-itc-wall-clock-20-dia-14246sv-830/product_24566778?cid=ps:gs:dot:nb:pla:furn&gad_source=1&gad_campaignid=14022539736&gbraid=0AAAAACN4I7y4NkkLAQ-eSnTP78fASlaFs&gclid=EAIaIQobChMI6Y6u7ouPlAMVnylECB2ORhUqEAQYCiABEgKbH_D_BwE
TAKEAWAY
Three questions to take with you
CSE 5 — Discussion 4 · What Is an Image Model?
14 / 14
HEALTHCARE
If a medical image could be AI-generated, what does that mean for telehealth and remote consultations?
BUSINESS
Does your company need to label AI-generated marketing material? In which contexts is disclosure required?
HUMANITIES
When a 'photograph' may have no real subject — does photography still exist as an art form?
AI image generation isn't replacing photography. It's splitting "image" into two —�records of reality, and imagined images.