1 of 68

Unlocking Insights:

Leveraging LLMs to Extract Structured Data from Unstructured Sources

John Horton

MIT Sloan & NBER

2 of 68

This session sounds boring.

3 of 68

Or, if not boring, per se,

DBA/CTO business---not executive business.

4 of 68

Why it is not boring and is executive business

Luck aside, the success of an organization largely reflects the quality of the decisions that organization makes
Short of a Steve Jobs-like genius-savant-CEO/founder, we have to make decisions with data
Often the data we have at hand is wildly inadequate for the decision we must make
How do we fix this?

Collect more data
Extract more insight from the data we already have

5 of 68

Key Idea for today:

Recent advances in AI, properly harnessed, can help our data-poor decision-making.

6 of 68

Let's jump in.

7 of 68

What is "unstructured" data?

Unstructured data is data that is not in a form that is as useful as it could be for our purposes. Examples:

Audio
Long/free form text
Videos
Images
PDFs
Mixed documents (e.g., research papers with tables, figures and text)

"Structured" data

We often mean tabular data, typically with some well-defined structure

Columns with meaningful headers
Rows are observations

Or something we could easily represent in a database

8 of 68

Unstructured biographical data as raw text

9 of 68

Jeopardy-like exercise: What questions does this data answer?

Where is our user from?
What is their age (partial)?
What is their marital status?
Do they have children?
Do they seem/happy/excited or angry?

10 of 68

Jeopardy-like exercise: What questions does this data answer?

Where is our user from?
What is their age (partial)?
What is their marital status?
Do they have children?
Do they seem/happy/excited or angry?

We (as humans) could easily answer these questions on a case-by-case basis. But a spreadsheet of 5,000 bio texts would be infeasible to work with

11 of 68

Let's structure this unstructured

biographical data using AI (GPT-4o)

13 of 68

This other text is from an open source python package that makes it easier to work with LLMs

14 of 68

Note this syntax that's like a kind of placeholder - this is where we can "pass in" the early context

16 of 68

Note they "key:value" pair. The 'key' is the name; it as an associated value, namely, well, my name.

17 of 68

It captured a 'nested' relationship - within the 'family' structure. There are more keys and values, logically grouped.

18 of 68

It also has a structure that can handle there being more than one value for a given key (in this case, multiple children)

19 of 68

With our structured data, we could use "normal" approaches to do analysis

Where is our user from? → data['residence']
What is their age (partial)? → 'Adult' if data['family']['kids'] is not None
What is their marital status? → data['family']['spouse'] is not None
Do they have children? → len(data['family']['kids'])>0
Do they seem/happy/excited or angry?

More on this one later

20 of 68

What's possible here,

extraction-wise,

is changing rapidly.

21 of 68

This is like maybe 20 "tokens" (more on 'token' definition later)

22 of 68

New models can handle enormous context windows

23 of 68

The long context: What 1 million tokens gets you

24 of 68

Production-versus-analysis cost comparison

Using the full 1M tokens with an advanced model might cost $3

This cost will surely fall, but still might be so high as to completely obliterate the unit economics if done as part of "production"

BUT…

For analytics work, this is trivial. Your data scientists cost 100x times this every hour
You can sample!

25 of 68

"Classification":

Saying what something is

26 of 68

"Do they seem/happy/excited or angry?"

27 of 68

"Do they seem/happy/excited or angry?"

Note that unlike the other questions, the answer is not "here" explicitly---we must make an inference.

32 of 68

This joke has become (badly) dated. The "is it a bird" task is now trivially easy.

33 of 68

Let's send the

cartoon to GPT-4o

36 of 68

Let's do a simpler classification task.

37 of 68

Imagine we have unstructured data. And we have examples of "labels" for the data.

39 of 68

The "old" (say 3 years ago) way

would be to train a classifier with examples.

40 of 68

Training an ML classifier

We'd need a lot more examples (than 6)
Even then, what is the input?

"Massachusetts" is not a number.
We can split into 'tokens': https://gptforwork.com/tools/tokenizer

41 of 68

https://gptforwork.com/tools/tokenizer

42 of 68

Training an ML classifier

We'd need a lot more examples than, say 6
Even then, what is the input?

"Massachusetts" is not a number.
We can split into 'tokens': https://gptforwork.com/tools/tokenizer

What if:

classifier had seen tokens of "New" (from "New Zealand" → Boop)
and "Mexico" (from "Mexico" → Boop)
How would it do with "New Mexico"?

Really wants to be say "Boop"

Would it ever generalize to unseen examples?

E.g., If I had every state but Wyoming and every country, how would even the best model do on "Wyoming?"

43 of 68

Humans, show us what you're made of:

44 of 68

Why do we do well?

45 of 68

Now: In-context learning with a frontier model (GPT-4o)

46 of 68

Now: In-context learning

The very limited training data gets put directly in the context window

47 of 68

Now: In-context learning

I give it the valid output labels to keep the model generation structured

50 of 68

How does the model do this?!?

This a case where unpacking the "GPT" acronym is helpful

GPT = "Generative Pre-Trained Transformer"
"Pre-trained" on a *lot* of text

As such, the pre-trained model already "knows" a great deal.

It knows what US States are
It know what countries are
It knows what arrows ("-->") mean
It "understands" instructions

It can use all that information to do zero/one-shot learning

Why, exactly, it can do this is not totally clear

51 of 68

What does this mean, practically?

Using a language model on a one-shot classification task might work extremely well right "out of the box"
You can also "fine-tune" a model with your own examples

Start with a base model and give it your examples

If you use the (relatively expensive) GPT approach for classification, you eventually create a very good quasi-synthetic dataset to use

It's like you have many examples of labeled data you can use to train a cheaper model

52 of 68

Managerial implications

Given we have this very powerful classification/structuring tool

What data are we collecting now that we aren't making use of?
Given what we can now do, what data are we not collecting that we could/should collect?

53 of 68

A very simple example:

Transactional email

54 of 68

"Please, don't talk to us - we can't handle it!"

57 of 68

Not only is this not much money now,

very likely this is the most

expensive it will ever be.

58 of 68

Extended email example

EDSL is an open source Python package for simulating surveys, experiments and market research with AI agents and large language models.

It simplifies common tasks of LLM-based research:

Prompting LLMs to answer questions
Specifying the format of responses
Using AI agent personas to simulate responses for target audiences
Comparing & analyzing responses for multiple LLMs at once

59 of 68

Example: Analyzing unstructured data

Problem: Customer service emails need to be categorized and triaged.�

EDSL workflow:

Construct questions
Add the emails to the questions
Send the questions to a language model
Analyze the responses in a dataset

60 of 68

Constructing a question

Select a question type: multiple choice, free text, linear scale, etc.
Format the question:

QuestionList prompts the model to format its response as a list.

We can parameterize questions to reuse them.

61 of 68

Adding data / context to a question

Import data: PDF, CSV, doc, image, table, list, dictionary, etc.
Create “Scenario” dictionaries representing the data

Match the {{ placeholder }} used in the question.

62 of 68

Running a question & inspecting results

We can use the response as answer options for a new question.

63 of 68

Adding data / context to a question

Now we create a separate scenario for each email.

64 of 68