1 of 68

Unlocking Insights:

Leveraging LLMs to Extract Structured Data from Unstructured Sources

John Horton

MIT Sloan & NBER

2 of 68

This session sounds boring.

3 of 68

Or, if not boring, per se,

DBA/CTO business---not executive business.

4 of 68

Why it is not boring and is executive business

  • Luck aside, the success of an organization largely reflects the quality of the decisions that organization makes
  • Short of a Steve Jobs-like genius-savant-CEO/founder, we have to make decisions with data
  • Often the data we have at hand is wildly inadequate for the decision we must make
  • How do we fix this?
    • Collect more data
    • Extract more insight from the data we already have

5 of 68

Key Idea for today:

Recent advances in AI, properly harnessed, can help our data-poor decision-making.

6 of 68

Let's jump in.

7 of 68

What is "unstructured" data?

  • Unstructured data is data that is not in a form that is as useful as it could be for our purposes. Examples:
    • Audio
    • Long/free form text
    • Videos
    • Images
    • PDFs
    • Mixed documents (e.g., research papers with tables, figures and text)
  • "Structured" data
    • We often mean tabular data, typically with some well-defined structure
      • Columns with meaningful headers
      • Rows are observations
    • Or something we could easily represent in a database

8 of 68

Unstructured biographical data as raw text

9 of 68

Jeopardy-like exercise: What questions does this data answer?

  • Where is our user from?
  • What is their age (partial)?
  • What is their marital status?
  • Do they have children?
  • Do they seem/happy/excited or angry?

10 of 68

Jeopardy-like exercise: What questions does this data answer?

  • Where is our user from?
  • What is their age (partial)?
  • What is their marital status?
  • Do they have children?
  • Do they seem/happy/excited or angry?

We (as humans) could easily answer these questions on a case-by-case basis. But a spreadsheet of 5,000 bio texts would be infeasible to work with

11 of 68

Let's structure this unstructured

biographical data using AI (GPT-4o)

12 of 68

13 of 68

This other text is from an open source python package that makes it easier to work with LLMs

14 of 68

Note this syntax that's like a kind of placeholder - this is where we can "pass in" the early context

15 of 68

16 of 68

Note they "key:value" pair. The 'key' is the name; it as an associated value, namely, well, my name.

17 of 68

It captured a 'nested' relationship - within the 'family' structure. There are more keys and values, logically grouped.

18 of 68

It also has a structure that can handle there being more than one value for a given key (in this case, multiple children)

19 of 68

With our structured data, we could use "normal" approaches to do analysis

  • Where is our user from? → data['residence']
  • What is their age (partial)? → 'Adult' if data['family']['kids'] is not None
  • What is their marital status? → data['family']['spouse'] is not None
  • Do they have children? → len(data['family']['kids'])>0
  • Do they seem/happy/excited or angry?
    • More on this one later

20 of 68

What's possible here,

extraction-wise,

is changing rapidly.

21 of 68

This is like maybe 20 "tokens" (more on 'token' definition later)

22 of 68

New models can handle enormous context windows

23 of 68

The long context: What 1 million tokens gets you

24 of 68

Production-versus-analysis cost comparison

  • Using the full 1M tokens with an advanced model might cost $3
    • This cost will surely fall, but still might be so high as to completely obliterate the unit economics if done as part of "production"
  • BUT…
    • For analytics work, this is trivial. Your data scientists cost 100x times this every hour
    • You can sample!

25 of 68

"Classification":

Saying what something is

26 of 68

"Do they seem/happy/excited or angry?"

27 of 68

"Do they seem/happy/excited or angry?"

Note that unlike the other questions, the answer is not "here" explicitly---we must make an inference.

28 of 68

29 of 68

30 of 68

31 of 68

32 of 68

This joke has become (badly) dated. The "is it a bird" task is now trivially easy.

33 of 68

Let's send the

cartoon to GPT-4o

34 of 68

35 of 68

36 of 68

Let's do a simpler classification task.

37 of 68

Imagine we have unstructured data. And we have examples of "labels" for the data.

38 of 68

39 of 68

The "old" (say 3 years ago) way

would be to train a classifier with examples.

40 of 68

Training an ML classifier

  • We'd need a lot more examples (than 6)
  • Even then, what is the input?
    • "Massachusetts" is not a number.
    • We can split into 'tokens': https://gptforwork.com/tools/tokenizer

41 of 68

42 of 68

Training an ML classifier

  • We'd need a lot more examples than, say 6
  • Even then, what is the input?
    • "Massachusetts" is not a number.
    • We can split into 'tokens': https://gptforwork.com/tools/tokenizer
  • What if:
    • classifier had seen tokens of "New" (from "New Zealand" → Boop)
    • and "Mexico" (from "Mexico" → Boop)
    • How would it do with "New Mexico"?
      • Really wants to be say "Boop"
  • Would it ever generalize to unseen examples?
    • E.g., If I had every state but Wyoming and every country, how would even the best model do on "Wyoming?"

43 of 68

Humans, show us what you're made of:

44 of 68

Why do we do well?

45 of 68

Now: In-context learning with a frontier model (GPT-4o)

46 of 68

Now: In-context learning

The very limited training data gets put directly in the context window

47 of 68

Now: In-context learning

I give it the valid output labels to keep the model generation structured

48 of 68

49 of 68

50 of 68

How does the model do this?!?

  • This a case where unpacking the "GPT" acronym is helpful
    • GPT = "Generative Pre-Trained Transformer"
    • "Pre-trained" on a *lot* of text
  • As such, the pre-trained model already "knows" a great deal.
    • It knows what US States are
    • It know what countries are
    • It knows what arrows ("-->") mean
    • It "understands" instructions
  • It can use all that information to do zero/one-shot learning
    • Why, exactly, it can do this is not totally clear

51 of 68

What does this mean, practically?

  • Using a language model on a one-shot classification task might work extremely well right "out of the box"
  • You can also "fine-tune" a model with your own examples
    • Start with a base model and give it your examples
  • If you use the (relatively expensive) GPT approach for classification, you eventually create a very good quasi-synthetic dataset to use
    • It's like you have many examples of labeled data you can use to train a cheaper model

52 of 68

Managerial implications

  • Given we have this very powerful classification/structuring tool
    • What data are we collecting now that we aren't making use of?
    • Given what we can now do, what data are we not collecting that we could/should collect?

53 of 68

A very simple example:

Transactional email

54 of 68

"Please, don't talk to us - we can't handle it!"

55 of 68

56 of 68

57 of 68

Not only is this not much money now,

very likely this is the most

expensive it will ever be.

58 of 68

Extended email example

EDSL is an open source Python package for simulating surveys, experiments and market research with AI agents and large language models.

It simplifies common tasks of LLM-based research:

  • Prompting LLMs to answer questions
  • Specifying the format of responses
  • Using AI agent personas to simulate responses for target audiences
  • Comparing & analyzing responses for multiple LLMs at once

59 of 68

Example: Analyzing unstructured data

Problem: Customer service emails need to be categorized and triaged.�

EDSL workflow:

  1. Construct questions
  2. Add the emails to the questions
  3. Send the questions to a language model
  4. Analyze the responses in a dataset

60 of 68

Constructing a question

  • Select a question type: multiple choice, free text, linear scale, etc.
  • Format the question:

QuestionList prompts the model to format its response as a list.

We can parameterize questions to reuse them.

61 of 68

Adding data / context to a question

  • Import data: PDF, CSV, doc, image, table, list, dictionary, etc.
  • Create “Scenario” dictionaries representing the data

Match the {{ placeholder }} used in the question.

62 of 68

Running a question & inspecting results

We can use the response as answer options for a new question.

63 of 68

Adding data / context to a question

Now we create a separate scenario for each email.

64 of 68

Running a question & inspecting results

Results include fields about the questions, scenarios, agents, models, answers, raw responses, tokens generated, costs

65 of 68

Applying survey logic, rules & memory

  • We can add logic and rules to a survey of questions: skip/stop, etc.

We can combine questions in a survey and add logic/rules for how they are answered.

66 of 68

Running a survey & inspecting results

Follow-on question is/not administered based on response to first question

67 of 68

Concluding thoughts

  • We already collect a great deal of data that could be made more useful
  • The costs of doing this structuring have fallen dramatically
  • Not only can get we get more out of what we had before, we can collect more data
    • And we can potentially collect it in much more "natural" ways

68 of 68

Thanks!