Unlocking Insights:
Leveraging LLMs to Extract Structured Data from Unstructured Sources
John Horton
MIT Sloan & NBER
This session sounds boring.
Or, if not boring, per se,
DBA/CTO business---not executive business.
Why it is not boring and is executive business
Key Idea for today:
Recent advances in AI, properly harnessed, can help our data-poor decision-making.
Let's jump in.
What is "unstructured" data?
Unstructured biographical data as raw text
Jeopardy-like exercise: What questions does this data answer?
Jeopardy-like exercise: What questions does this data answer?
We (as humans) could easily answer these questions on a case-by-case basis. But a spreadsheet of 5,000 bio texts would be infeasible to work with
Let's structure this unstructured
biographical data using AI (GPT-4o)
This other text is from an open source python package that makes it easier to work with LLMs
Note this syntax that's like a kind of placeholder - this is where we can "pass in" the early context
Note they "key:value" pair. The 'key' is the name; it as an associated value, namely, well, my name.
It captured a 'nested' relationship - within the 'family' structure. There are more keys and values, logically grouped.
It also has a structure that can handle there being more than one value for a given key (in this case, multiple children)
With our structured data, we could use "normal" approaches to do analysis
What's possible here,
extraction-wise,
is changing rapidly.
This is like maybe 20 "tokens" (more on 'token' definition later)
New models can handle enormous context windows
The long context: What 1 million tokens gets you
Production-versus-analysis cost comparison
"Classification":
Saying what something is
"Do they seem/happy/excited or angry?"
"Do they seem/happy/excited or angry?"
Note that unlike the other questions, the answer is not "here" explicitly---we must make an inference.
This joke has become (badly) dated. The "is it a bird" task is now trivially easy.
Let's send the
cartoon to GPT-4o
Let's do a simpler classification task.
Imagine we have unstructured data. And we have examples of "labels" for the data.
The "old" (say 3 years ago) way
would be to train a classifier with examples.
Training an ML classifier
Training an ML classifier
Humans, show us what you're made of:
Why do we do well?
Now: In-context learning with a frontier model (GPT-4o)
Now: In-context learning
The very limited training data gets put directly in the context window
Now: In-context learning
I give it the valid output labels to keep the model generation structured
How does the model do this?!?
What does this mean, practically?
Managerial implications
A very simple example:
Transactional email
"Please, don't talk to us - we can't handle it!"
Not only is this not much money now,
very likely this is the most
expensive it will ever be.
Extended email example
EDSL is an open source Python package for simulating surveys, experiments and market research with AI agents and large language models.
It simplifies common tasks of LLM-based research:
Example: Analyzing unstructured data
Problem: Customer service emails need to be categorized and triaged.�
EDSL workflow:
Constructing a question
QuestionList prompts the model to format its response as a list.
We can parameterize questions to reuse them.
Adding data / context to a question
Match the {{ placeholder }} used in the question.
Running a question & inspecting results
We can use the response as answer options for a new question.
Adding data / context to a question
Now we create a separate scenario for each email.
Running a question & inspecting results
Results include fields about the questions, scenarios, agents, models, answers, raw responses, tokens generated, costs
Applying survey logic, rules & memory
We can combine questions in a survey and add logic/rules for how they are answered.
Running a survey & inspecting results
Follow-on question is/not administered based on response to first question
Concluding thoughts
Thanks!