How to Use Your Development Data to Make LLMs Code
Like You and Your Team
Tyler Dunn, Co-founder & CEO of Continue
Continue is on a mission to make building software feel like making music
Continue is a modular, open-source Copilot alternative
It’s built as a reusable set of components that enable developers to create their own copilot
First, why do I want to make LLMs code like me and my team?
As developers, we want to experience flow state
Getting stuck disrupts our flow state
This is why so many of us are excited about software development copilots
But bad / wrong suggestions disrupt flow state too
Blog post by
Okay, but what is development data?
Dev data = how you build software
Data on the stuff that happens in between Git commits
Created as a by-product of using LLMs while coding
How to use your development data
Step 1
Collect your dev data and look at it
Step 2
Improve the compound AI system
Step 3
Improve the Large Language Models (LLMs)
How to use your development data
Step 1
Collect your dev data and look at it
Step 2
Improve the compound AI system
Step 3
Improve the Large Language Models (LLMs)
Collect your dev data and look at it
Collect your development data and look at it
How to use your development data
Step 1
Collect your dev data and look at it
Step 2
Improve the compound AI system
Step 3
Improve the Large Language Models (LLMs)
How to use your development data
Step 1
Collect your dev data and look at it
Step 2
Improve the compound AI system
Step 3
Improve the Large Language Models (LLMs)
Improve the compound AI system
Software dev copilots are compound AI systems
Software development AI systems today include many components
Provide clear and comprehensive instructions
vs.
Add a system message with instructions that should always be followed
vs.
Automatically filter for obviously bad suggestions and ask for a new suggestion
Examples
Improve how context from your codebase + software development lifecycle is retrieved and used
Select the right model for the job
“Chat” model
“Tab” model
How to use your development data
Step 1
Collect your dev data and look at it
Step 2
Improve the compound AI system
Step 3
Improve the Large Language Models (LLMs)
How to use your development data
Step 1
Collect your dev data and look at it
Step 2
Improve the compound AI system
Step 3
Improve the Large Language Models (LLMs)
Improve the LLMs
The ideal data for an LLM
By-product of using LLMs → close to ideal data
When you use LLMs while coding, you create development data that shows
Google is already using their development data
So what development data is helpful now?
Examples
Use fine-tuning to improve existing LLMs
dltHub fine-tuned StarCoder 2 on their codebase, docs, accepted tab autocomplete data, etc.
Domain-specific instructions + hundreds of GPU hours
GigaML is fine-tuning StarCoder 2 on accepted tab autocomplete data
Use domain-adaptive continued pre-training to improve open-source LLMs
How Code Llama was created by Meta
How ChipNeMo was created by Nvidia
Billions of tokens of relevant company data + thousands of GPU hours
Pre-train your own LLM from scratch
OpenAI, MosaicML, Together, etc. will help you train your own custom model
Trillions of tokens of Internet data + company data + millions of GPU hours
Replit trained their own model
How to use your development data
Step 1
Collect your dev data and look at it
Step 2
Improve the compound AI system
Step 3
Improve the Large Language Models (LLMs)
TL;DR: Dev data can be used to automate even more
Thanks!
We are at the beginning on this journey :)
Lots more R&D to come!
We are hiring
Appendix