1 of 17

2 of 17

DevTools for Large Language Models: Unlocking the Future of AI-Driven Applications

Diego M. Oppenheimer - @doppenhe

3 of 17

Foundational Models vs Large Language Models

FMs:

Broad general purpose models aim to capture broad range of knowledge and capabilities (GPT-4, CLIP, DALL-E)

Billions of parameters

Designed to be adapted and fine tuned to be task specific using limited data.

LLMs

Specifically focused on language understanding and generation. (GPT, BERT, LLAMA)

Trained on massive text data sets from multiple domains

Tasks include classification, generation, translation, summarization and more.

4 of 17

A quick walk down memory lane

Model Size

Capabilities

Big Bang

No models

Emptiness

Self-Supervision

Data

Nothing

125-350M

Vomit up reddit

Small web/book dump

Who Could Get to 100B First?

1-100B

Taskless text generation

All the web

Instructed Tuned and Massive

10-200+B

Task generality and listen to feedback

Heavily curated and labeled web-scale data (probs cost billions)

As size and data quality increase, you get more generalization and in-context behaviors but higher cost

5 of 17

Entering the “Holy $#A!” phase

Early stage of development around new platforms tend to be simple wrappers:

Micro processor -> Single board computers

Operating system -> wrappers on utilities

Internet -> wrappers on unix and network utilities

GAI today -> wrappers on LLMs

6 of 17

Thriving developer ecosystem

7 of 17

The Development Process with LLMs

As developers continue to experiment with LLMs, a thriving ecosystem of DevTools is cropping up to support their work. These tools are designed to enable developers to iterate more quickly and build amazing features with LLMs. We'll be diving into the types of DevTools that exist, and why they're so important in the development process. But first, let's look at a rough sketch of what building an LLM-powered feature looks like today.

The development process with LLMs typically involves several steps: experimentation and prompting, knowledge retrieval, and fine-tuning. In the experimentation phase, developers tinker with prompts to get the desired response from the language model. This may involve trying different prompts or chains of prompts. Next, knowledge retrieval involves providing relevant context to the model to improve its accuracy. This context can come in the form of related documents or search results. Finally, fine-tuning involves improving the model's accuracy and reducing inference latency to make it more suitable for production use. Now, let's explore how DevTools play a role in each of these steps.

8 of 17

The Development Process with LLMs

GAI today -> wrappers on LLMs

9 of 17

Orchestration, Experimentation and Prompting Tools

LLMs “APIs” are natural language in form of prompts

Mastering the API requires tinkering and experimenting with single and chained prompts

Various tools have emerged that provide ability:

to connect to data sources,
provide indices,
coordinate chains of calls
other core abstractions

*not complete list

One of the key aspects of working with LLMs is experimentation and prompting. To get the desired output from an LLM, developers often need to experiment with different prompts and chains of prompts,This can be a complex and time-consuming process. Fortunately, tools like LangChain and GPT Index have emerged to help jumpstart and manage this experimentation process. These tools assist developers in crafting effective prompts, testing different variations, and understanding the probabilistic nature of LLM output. By simplifying the experimentation phase, these tools are playing a critical role in the ecosystem, inspiring amazing new use cases and features."

LLMs have weird APIs — you provide natural language in the form of a “prompt” and get a probabilistic response back. It turns out that mastering this API requires a lot of tinkering and experimentation – to solve a new task, you’ll probably need to try a lot of different prompts (or chains of prompts) to get the answer you’re looking for. Simply getting comfortable with the probabilistic nature of LLM output takes time — you’ll need to do extensive testing to understand the boundary cases of your prompt.

‍Tools have emerged to help jumpstart and manage this experimentation process. In particular, LangChain and GPT Index have exploded in popularity and use:

10 of 17

Knowledge Retrieval and Vector Databases

Provide LLMs with contextual knowledge

“Memory”

Vector databases provide:

efficient vector similarity searches (retrieval)

efficient storage of up to billions of embeddings
efficient indexing capabilities

*not complete list

11 of 17

Building V2 of LLM Features- Fine Tuning Language Models

*not complete list

Goal make models:

More accurate
Faster and cheaper to run

Quality of labels more important than quantity

12 of 17

Monitoring, Observability and Testing

*not complete list

LLM Performance: Unique challenge; assess quality via user interactions.

A/B Testing: Evaluate LLM features via product analytics (full workflow)

Eye Test vs. HELM: as OSS gains traction comparison frameworks will become more critical.

Performance Impact: Affects UX; guides model selection and fine-tuning.

Monitoring, Observability, and Testing: There are some unique challenges in understanding and managing the performance of language models – most acutely, it can be very challenging to measure the “performance” of an LLM feature. To understand how “good” generated content is, you’ll need to actually measure how users are interacting with that content. Th

at often means A/B tests and comprehensive product analytics just to assess performance. Most teams with LLM features are still checking the results with an eye test – across some number of tests, do the results look “good’? As adoption of OSS LLM models becomes more ubiquitous testing and comparison frameworks like “HELM” will become more and more important.

‍

Answering questions about LLM performance is critical for multiple reasons. Obviously, performance directly impacts UX, but it also impacts how easily you can decide when it makes sense to switch to a smaller and cheaper model or when to perform fine-tuning.

13 of 17

Testing, Assurance and Guardrails

*not complete list

LLMs can generate plausible but incorrect information which pose risk for ‘low-affordability’ use cases like medical diagnosis, financial decisions.

Guardrails are needed to ensure safety , accuracy and reliability

Tools that allow users to define rules, schemas, and heuristics for LLM outputs will be crucial to build trust in these systems

Monitoring, Observability, and Testing: There are some unique challenges in understanding and managing the performance of language models – most acutely, it can be very challenging to measure the “performance” of an LLM feature. To understand how “good” generated content is, you’ll need to actually measure how users are interacting with that content. That often means A/B tests and comprehensive product analytics just to assess performance. Most teams with LLM features are still checking the results with an eye test – across some number of tests, do the results look “good’? As adoption of OSS LLM models becomes more ubiquitous testing and comparison frameworks like “HELM” will become more and more important.

‍

Answering questions about LLM performance is critical for multiple reasons. Obviously, performance directly impacts UX, but it also impacts how easily you can decide when it makes sense to switch to a smaller and cheaper model or when to perform fine-tuning.

14 of 17

Some future predictions

* Iteration cycles define winning developer experiences:

* Larger models, even more access and powerful wrappers:

15 of 17

* GPT-You:

MLOps tooling evolves to enable “personalized” FMs: trained on your own data and workflows:

Data is the most durable moat
Last mile is where the value is generated

16 of 17

Thank you

Diego Oppenheimer

doppenheimer

@doppenhe

17 of 17

Credits

David Hershey

Laurel Orr

Matt Turk