2 of 27

Emerging Stack for LLM-powered Products

What does it take to go from PoC to Production

Kuldeep Yadav, PhD

https://www.linkedin.com/in/kyadav/

Real World

AI Bootcamp

3 of 27

Agenda

Why should you think about LLM stack?

Lifecycle of a LLM product

Design patterns to take LLM products from PoC to Production

Key libraries/platforms that you need

Infrastructure/tooling that are yet to be built

Real World

AI Bootcamp

4 of 27

What is the most important ingredient of a successful LLM-powered product?

Slido: #3067587

Real World

AI Bootcamp

5 of 27

Few years back …

https://xkcd.com/1425/

Now…

Everyone has access to same model

How do you differentiate?

Real World

AI Bootcamp

6 of 27

Why there is a different stack for LLMs vs ML?

Real World

AI Bootcamp

7 of 27

Typical ML Pipeline

Weeks

Infrequent ( Months)

Secret Sauce

Real World

AI Bootcamp

8 of 27

LLM Lifecycle

Days

Very Frequent

What is the secret sauce?

Real World

AI Bootcamp

9 of 27

User Expectations

Ref: https://www.sh-reya.com/blog/ai-engineering-short/

Real World

AI Bootcamp

10 of 27

Typical Development Lifecycle with LLMs

Scope

Does your product/feature really need LLMs?

Build

Iterate to design, develop, and test the product

Deploy

Deploy it for real users; system reliability; plans to scale

Monitor

Monitor the performance, availability closely

Iterate to improve

To ask the right questions is already half the solution of a problem – Carl Jung

Real World

AI Bootcamp

11 of 27

Scope: Key Question to #ASK

Scope

“Testing the Waters”

Does this task/feature/product need LLMs?

What are business metrics that it will impact?

Where does this fit in user journey?

How will I present the output to the users?

Real World

AI Bootcamp

12 of 27

Scope: Things to Do

Create a dataset of at least 10 examples with input and output pairs

Write prompt and generate the LLM output for these examples

Compare the output across paid (GPT, Gemini, Claude, etc) and open-source LLMs (Llama)

Evaluate the initial efforts/complexity in prompting techniques,

Breaking down the prompts
In-context examples
Single-shot vs Agentic

Is your product follows a well-established design?

Semantic Search/Retrieval-augmented Generation (RAG)
Text to SQL
Agents

Real World

AI Bootcamp

13 of 27

What are agents really?

Real World

AI Bootcamp

14 of 27

Scope: Tools to Use

ChainForge (Open-source, easy to install)

Athina IDE (SaaS, easy to try)

Several other tools: LangChain, Jupyter Notebooks, ChatGPT, and so on….

Real World

AI Bootcamp

15 of 27

What are the common mistakes that people make in scoping?

Slido: #3067587

Real World

AI Bootcamp

16 of 27

Build: Making it real

Continue asking questions to yourself

How do I break my business metrics as “evaluation metrics”? Are my evaluation metrics measurable?

Do my workflow need tool calling?

What sort of latency that my application needs to have?

How do I build product UI/UX to take care of hallucinations, errors?

What are the high-level cost estimates?

Real World

AI Bootcamp

17 of 27

Build: Benchmarking Datasets

Create a data annotation pipeline, quality data collection/calibration depends on how easy it is for users to rate

LabelStudio (Open-source, easy to use templates)

Real World

AI Bootcamp

18 of 27

Build: Workflow Orchestration

https://www.techtarget.com/searchenterpriseai/definition/LangChain

Complex LLM applications are interconnected “graphs” of LLM calls and require careful orchestration
Pre-built modules and workflow capabilities for your task

Real World

AI Bootcamp

19 of 27

Build: Workflow Orchestrators

Pick a workflow orchestator well-known for a task (i.e. LangGraph for Agents, LLamaIndex for RAG)

Examples: LangChain, LangGraph, LLamaIndex, flowise

Real World

AI Bootcamp

20 of 27

Build: Prompting

New prompting paradigm are “emergent”, keep yourself updated

Chain-of-thoughts
Reflection
Read it again

Iterate, iterate, iterate on the evaluation datasets to see what works

Single-shot
In-context learning
Agentic
Or do you need fine-tuning?

Use tools for storing different version prompts, experiment history, evaluation results
Prompt optimization to find the best prompt

Social Media

LangChain, PromptHub, Athina

DsPy, Adelflow

Real World

AI Bootcamp

21 of 27

Build: Testing/Evaluating LLM Products

Real World

AI Bootcamp

22 of 27

Build: Investing in create a data fly-wheel

https://www.sh-reya.com/blog/ai-engineering-flywheel/

Continuous Development

Continuous Integration

Continuous Improvement

Real World

AI Bootcamp

23 of 27

Build: Let's take a deeper dive in Evaluation

Evaluation has the most action in LLM Tooling Space; 8-10 good SaaS providers including LangSmith, Galileo, BrainTrust, Arize, etc

Assertions using PyTest or some other libraries

Unit Tests

Specific Q&A
LLM as a Judge

Model Evaluations

Analyze the traces as you build
Keep building the training datasets

Human Evaluations

Useful where inter-human subjectivity plays a major role

A/B Testing

Real World

AI Bootcamp

24 of 27

Methodological evaluation is a major difference between mediocre and great apps!

Real World

AI Bootcamp

25 of 27

Deployment and Monitoring�

Use caching to save costs

Continuous observability

Observe costs, failure rates, new data patterns
Keep track of key performance metrics
Automatic flag the drift across metrics

Evaluation is a constant across the whole lifecycle

Keep iterating on “Build” and adding new data points to the mix

Online guardrails to prevent attacks

Key Tools: PortKey, Galileo, Redis, GPTCache, Helicone

Real World

AI Bootcamp

26 of 27

Design Patterns of Successful Builders

Make data your friend

Collect, Analyze, Annotate, Breathe it
Invest in data pipelines and continuous feedback generation

Focus deeply on your user, workflow, domain, and the use-case

Everyone else has access to the same model

Evaluation should be omni-present in your entire lifecycle

Do not make product testing an afterthought especially with LLMs

Iterate quickly

Speed of experimentation and shipping is the new moat

Observability

You cannot take your eyes off even for a week

Tools and Stack

They are great but not replacement for a sound methodology and approach

Real World

AI Bootcamp

1 of 27

2 of 27

3 of 27

4 of 27

5 of 27

6 of 27

7 of 27

8 of 27

9 of 27

10 of 27

11 of 27

12 of 27

13 of 27

14 of 27

15 of 27

16 of 27

17 of 27

18 of 27

19 of 27

20 of 27

21 of 27

22 of 27

23 of 27

24 of 27

25 of 27

26 of 27

27 of 27