Open Large Language Models
for Code
Loubna Ben Allal
Machine Learning Engineer, Science team
How it started: GitHub Copilot in 2021
How it’s going: Over 1.7k open models trained on code
How did we get here?
Strong Instruction-tuned and base models
What you need to train (code) LLMs from scratch
From GPT 1 → 4
| Dataset size�(Billion tokens) | Model size �(Billion parameter) |
GPT 1: | 1-2 | 0.11 |
GPT 2: | 10-20 | 1.4 |
GPT 3: | 300 | 175 |
GPT 4: | 10’000 | 1’800 |
100x
2000x
300x
GPT-4 cost: ~$100M
Compute:
Training Generative AI Models
Untrained Model
Pretrained “Base” Model
Supervised Finetuned (SFT) Model
RLHF/DPO..
👍👎
Chat LLM
(e.g. GPT-4)
Training Code LLMs
Untrained Model
Pretrained “Base” Model
Supervised Finetuned (SFT) Model
RLHF/DPO..
👍👎
Chat LLM
(e.g. GPT-4)
Instruction dataset for code: “write a function” “solve a bug”..
The Landscape of base open code LLMs
Others: CodeQwen from Qwen team, CodeGen from SalesForce, StableCode from StabilityAI…
closed model APIs
open model weights
fully open model
model weights not available
no access to training data or code
full access to model/code/data
>
>
The gradient of model releases
BigCode: open-scientific collaboration
We are building LLMs for code in a collaborative way:
1100+ researchers, engineers, lawyers, and policy makers
Open & Responsible Research on LLMs
Open-access
datasets
Open-access
models
Reproducible research
Documentation
From SantaCoder to StarCoder2 🚀
StarCoder
May 2023
SantaCoder
Dec 2022
StarCoder2
Feb 2024
The Stack: data collection
StarCoderData
800 GB of code in 86 programming languages, with GitHub Issues, Jupyter Notebooks and Git Commits
where did the 6TB go?
Data filtering
- Personal Identifiable Information (PII) removal
StarCoder
Model size: 15B parameters
Context length: 8096 tokens
Infrastructure: 512 A100 GPUs
Training length: 1T tokens / 250k steps
Training time: 24 days
Best open LLM for code at the time of release!
“smooth sailing”
The Stack v2
license filtering
67.5 TB of data
32.1 TB of data
near-
deduplication
2.4 TB
775B tokens
Language selection, filtering, preprocessing
GH Issues, PRs, and other high-quality data sources
140B tokens
915B tokens
Software Heritage
Raw Dataset
Data collection
Extra sources
The Stack: data inspection + opt-out
The Stack: data inspection + opt-out
StarCoder2
Model size: 15B, 7B, 3B
Context length: 16k tokens
Supports repository level context
Trained on 4T+ tokens
StarCoder2
Tooling
Auto-complete
Membership test
https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode
Dataset Search
https://huggingface.co/spaces/bigcode/search
Customize Code Models: Chat assistant
Instruction-tune a code model: Mix different open chat and code datasets https://hf.co/spaces/HuggingFaceH4/starchat2-playground
Customize Code Models: Code completion
Leaderboards
Leaderboards
Leaderboards
📑 The Stack
community pretraining
StarCoderBase
StarCoderBase-1B
StarCoderBase-3B
StarCoderBase-7B
StarCoder+
StarChat ⍺
StarChat β
instruction
tuning
instruction
tuning
more natural language
StarCoder
more Python
WizardCoder
PanGu-Coder 2
Defog-SQLCoder
StableCode
CodeGen 2.5
Replit-Code-3B
DeciCoder-1B
BigCode Ecosystem
community fine-tuning
OctoCoder
instruction tuning with
git commits
Challenges of a fully open collaboration
Future Directions
Thank you!
Contact: loubna@huggingface.co