1 of 25

AI Starter Pack

NICAR25

sk-proj-tu224CJfP3uTCbaaxMxXT3BlbkFJMLdFplC2c9GP1NjAFNtT

https://github.com/asg017/nicar25-ai-starter-js

Alex Garcia

Bluesky: @alexgarcia.xyz

https://alexgarcia.xyz

Javascript Edition!

2 of 25

hi

  • I’m Alex!
  • Independent software engineer
    • Was: Software engineer at Facebook after college
    • Contract work: Data visualizations, frontends, data pipelines, GIS, etc.
    • Interests: SQLite + extensions, local AI, D3, Observable notebooks, etc
  • Currently:
    • Working w/ Simon Willison on Datasette
    • Mozilla sponsored projects sqlite-vec
  • Past NICARs:
    • “Introduction to Observable notebooks”
    • “Introduction to Datasette”
    • Lightning talks
  • Our coach: Nael Shiab!

3 of 25

Agenda

  • Part 1: Embeddings
    • “Semantic search” with vectors!
    • Embeddings with Ollama, storing w/ sqlite-vec
    • Demo: NBC News headlines search engine
    • Bonus: Visualize embeddings with UMAP
  • Part 2: Structured Output Generation!
    • Force an LLM to output JSON with your own schema!
    • OpenAI + Vercel’s AI SDK
    • Demo: Political email parsing

4 of 25

Survey Time!

5 of 25

SETUP

~/Desktop/hands_on_classes/20250307-friday-ai-starter-pack-javascript

6 of 25

Part 1: Embeddings

7 of 25

Embeddings!

  • Main idea: Represent text as a list of numbers, so we can compare text with math
  • An “embedding” is a vector that represents a piece of text
    • A vector is a list of numbers!
  • Requires an “embeddings model”, ie a AI/ML model
    • Many free & open source options, easy to run!
    • Small enough to run on your laptop
  • Today’s demo: scraped NBC News article headlines
    • From June 2024 - Present, ~10k headlines

8 of 25

NBC News Headline Sample

id

year

month

headline

url

10320

2025

3

Elon Musk takes familiar fraud, waste claims to Joe Rogan with DOGE discussion

10319

2025

3

What to know about Tristan and Andrew Tate's controversies as they return to the U.S.

10318

2025

3

Zelenskyy's meeting with Trump and Vance unravels into an extraordinary clash

10317

2025

3

Special counsel who prosecuted Hunter Biden quietly resigned in January

10316

2025

3

FBI returns records from Mar-a-Lago search to Trump, White House says

10315

2025

3

Zelenskyy remains defiant after blowup with Trump and Vance

10314

2025

3

An employer's unreasonable request, a 'racist' glitch and a mystery reawakens: The news quiz

10313

2025

3

Trump still really wants to win a Nobel Peace Prize

10312

2025

3

Trump-Zelenskyy clash marks a defining turn away from U.S. defense of democracies

10311

2025

3

As Texas measles outbreak grows, here's what to know about the disease, vaccines and response

9 of 25

10 of 25

11 of 25

12 of 25

13 of 25

Commanders acquire wide receiver Deebo Samuel in trade with 49ers, AP source says

A privately built spacecraft has successfully landed on the moon

Pope Francis, stable in hospital, thanks well-wishers for support

Singer in Tehran arrested during live performance

Captain America: Brave New World' stays at the top on weak Oscars weekend at the box office

Where would these headlines go?

14 of 25

2 dimensions aren’t enough!

  • What about 3?

  • 4 dimensions??

15 of 25

jk lets do 384 dimensions

  • or 768! or 1024! or 1536!
  • “Vector dimensions” are determined by the embedding model you choose
    • In general: More dimensions, better quality
    • More dimensions, more expensive to generate + store
    • It’s a balance!
  • Today’s model: sentence-transformers/all-MiniLM-L6-v2
    • 384 dimensions
    • Trained circa Aug 2021

16 of 25

Generating embeddings w/ Ollama

  • Ollama - straightforward way to run LLM’s and embedding models on your laptop!
    • Other options: transformers.js, Python, llama.cpp, etc
    • But ollama is easy!

17 of 25

Demo demo demo

  • part 1: semantic search with NBC News headlines
  • part 2, time-willing: NICAR session/speakers search!

18 of 25

Bonus: Visualize embeddings

  • “Dimensionality reduction” (ie 384 dimensions -> 2 dimensions) allows you to visualize and compare embeddings!
  • Demo: https://observablehq.com/d/04bc1c1b0de9db7c
  • Other better tools for this! See tipsheet for more

19 of 25

Final words on embeddings

  • Can embed other data besides text!
    • Images, audio, faces (🤨)
    • Requires different models!
  • The “Training date” of an embedding model might matter a lot!
    • ex: Who “was” Elon Musk in August 2021?
    • Does the model know that “Dobbs Decision” (June 2022) is related to “reproductive rights”?
  • There are many, many embedding models (my favorites in tipsheet)

20 of 25

Part 2: Structured Output Generation

21 of 25

Chatbots are too chatty!

  • Sometimes you want data, not a conversation.
  • “Structured Output Generation”
    • Give a strict schema to an LLM to force it to return data in a format you expect
  • JSON Schema is the current standard
    • In JavaScript: done with Zod

22 of 25

Demo: Political Campaign Emails

  • s/o Derek Willis
  • Big idea: There’s a lot of “unstructured data” in political emails
    • Can we use LLMs to get that data out?
  • https://political-emails.herokuapp.com/emails/emails

23 of 25

Political Email Sample

The Force is strong with Trisha. | _Unsubscribe_

---

### Hey Peter, it’s Mark Hamill.

**In a moment, I’m going to ask you to make a donation to Trisha Calvarese, the official Democratic nominee running against Lauren Boebert in Colorado.**

But before I do, let me explain why right-wing extremists like Boebert scare me.

**Lauren Boebert isn’t just a national embarrassment – she has an actual vote in Congress.**

Boebert is ready to pass a national abortion ban, gut Medicare and Social Security, and throw our veterans under the bus if she’s reelected again.

_And don’t even get me started on Project 2025…_

---

### The Force is strong with Trisha.

That’s why I am so excited about Trisha’s campaign:

**New polling shows that Trisha can win and defeat Lauren Boebert! But Trisha can only flip this seat blue if she is able to raise enough money and fully fund her campaign.**

I’m ready to help Trisha Calvarese take on Lauren Boebert and finally kick her out of Congress, and I am asking you to join me today:

**Please, will you make a donation right now to Trisha Calvarese’s campaign to help her defeat Lauren Boebert this November?**

Winning this seat and turning it blue not only means getting rid of Boebert, but it will also get us one step closer to taking back the House for Democrats.

---

### _Please use the links in this email to start a monthly donation through ActBlue:_

| **DONATE $10** | **DONATE $25** |

| --- | --- |

| **DONATE $35** | **DONATE $50** |

| **DONATE $100** | **Other Amount** |

---

This election is about integrity and decency. Trisha and I are counting on people like you to chip in now and ensure the right side wins in November.

**Thank you,**

**Mark Hamill**

Actor and Activist

---

**Paid for by Trisha 4 Colorado**

Trisha 4 Colorado

PO Box 630633

Highlands Ranch, CO 80130

Contributions or gifts to Trisha 4 Colorado are not tax-deductible.

This email was sent to dpwillis67@gmail.com. If you received this email in error or if you don't want to receive any emails from us anymore, please [click here to unsubscribe](#).

---

**Sent via ActionNetwork.org.**

To update your email address, change your name or address, or to stop receiving emails from Trish 4 Colorado, [please click here](#).

![](https://click.actionnetwork.org/ss/o/u001.ZbNyqOfLYPaP-d23SgKjnQ/4b3/FxpymR5mQtaLWMB4wO7YBA/ho.gif)

24 of 25

Final word on structured output generation

  • LLM’s “lie” an hallucinate, and so does structured output generation!
    • In fact, it may lie more, since you’re forcing an answer out
  • Some models are better at SOG than others
    • Most models are trained to chat, not generate data
    • “Instruct” models are typically the best
  • Run local LLM’s to reduce cost!
    • Some models in tipsheet
  • Evals evals evals

25 of 25

Thank you!