1 of 37

Open Large Language Models

for Code

Loubna Ben Allal

Machine Learning Engineer, Science team

2 of 37

How it started: GitHub Copilot in 2021

3 of 37

How it’s going: Over 1.7k open models trained on code

4 of 37

How did we get here?

5 of 37

Strong Instruction-tuned and base models

6 of 37

What you need to train (code) LLMs from scratch

7 of 37

From GPT 1 → 4

Dataset size�(Billion tokens)

Model size �(Billion parameter)

GPT 1:

1-2

0.11

GPT 2:

10-20

1.4

GPT 3:

300

175

GPT 4:

10’000

1’800

100x

2000x

300x

GPT-4 cost: ~$100M

Compute:

8 of 37

Training Generative AI Models

Untrained Model

Pretrained “Base” Model

Supervised Finetuned (SFT) Model

RLHF/DPO..

👍👎

Chat LLM

(e.g. GPT-4)

9 of 37

Training Code LLMs

Untrained Model

Pretrained “Base” Model

Supervised Finetuned (SFT) Model

RLHF/DPO..

👍👎

Chat LLM

(e.g. GPT-4)

Instruction dataset for code: “write a function” “solve a bug”..

10 of 37

The Landscape of base open code LLMs

  • The Stack dataset
  • StarCoder & StarCoder2
  • 3B, 7B, 15B sizes
  • StarChat2 (with H4 team)
  • DeepSeek-Coder
  • DeepSeek-Coder-Instruct
  • 1B, 7B, 33B
  • CodeLlama
  • CodeLlama-Instruct
  • 7B, 13B, 70B

Others: CodeQwen from Qwen team, CodeGen from SalesForce, StableCode from StabilityAI…

11 of 37

closed model APIs

open model weights

fully open model

model weights not available

no access to training data or code

full access to model/code/data

>

>

The gradient of model releases

12 of 37

BigCode: open-scientific collaboration

We are building LLMs for code in a collaborative way:

  • Full data transparency
  • Open source processing and training code
  • Model weights released with commercial friendly license

1100+ researchers, engineers, lawyers, and policy makers

13 of 37

Open & Responsible Research on LLMs

Open-access

datasets

Open-access

models

Reproducible research

Documentation

14 of 37

From SantaCoder to StarCoder2 🚀

StarCoder

May 2023

  • 15B code generation model
  • 80+ languages
  • 33% Python score
  • Transparent dataset
  • Open Access

SantaCoder

Dec 2022

StarCoder2

Feb 2024

  • 1.1B code generation model
  • 3 languages
  • 18% Python score
  • Transparent dataset
  • Open Access

  • 15B code generation model
  • 600+ languages
  • 46% Python score
  • Transparent dataset
  • Open Access

15 of 37

The Stack: data collection

16 of 37

StarCoderData

800 GB of code in 86 programming languages, with GitHub Issues, Jupyter Notebooks and Git Commits

17 of 37

where did the 6TB go?

18 of 37

Data filtering

  • Near-deduplication

  • Language selection & quality inspection

  • Decontamination

- Personal Identifiable Information (PII) removal

19 of 37

StarCoder

Model size: 15B parameters

Context length: 8096 tokens

Infrastructure: 512 A100 GPUs

Training length: 1T tokens / 250k steps

Training time: 24 days

Best open LLM for code at the time of release!

“smooth sailing”

20 of 37

21 of 37

The Stack v2

license filtering

67.5 TB of data

32.1 TB of data

near-

deduplication

2.4 TB

775B tokens

Language selection, filtering, preprocessing

GH Issues, PRs, and other high-quality data sources

140B tokens

915B tokens

Software Heritage

Raw Dataset

Data collection

22 of 37

Extra sources

  • Jupyter notebooks: structured (code & markdown pairs) vs scripts
  • Kaggle notebooks
  • GitHub issues and pull requests
  • LHQ
  • Wikipedia, Arxiv, OpenWebMath

23 of 37

The Stack: data inspection + opt-out

24 of 37

The Stack: data inspection + opt-out

25 of 37

StarCoder2

Model size: 15B, 7B, 3B

Context length: 16k tokens

Supports repository level context

Trained on 4T+ tokens

26 of 37

StarCoder2

27 of 37

Tooling

Auto-complete

Membership test

https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode

28 of 37

Dataset Search

https://huggingface.co/spaces/bigcode/search

29 of 37

Customize Code Models: Chat assistant

Instruction-tune a code model: Mix different open chat and code datasets https://hf.co/spaces/HuggingFaceH4/starchat2-playground

30 of 37

Customize Code Models: Code completion

31 of 37

Leaderboards

32 of 37

Leaderboards

33 of 37

Leaderboards

34 of 37

📑 The Stack

community pretraining

StarCoderBase

StarCoderBase-1B

StarCoderBase-3B

StarCoderBase-7B

StarCoder+

StarChat ⍺

StarChat β

instruction

tuning

instruction

tuning

more natural language

StarCoder

more Python

WizardCoder

PanGu-Coder 2

Defog-SQLCoder

StableCode

CodeGen 2.5

Replit-Code-3B

DeciCoder-1B

BigCode Ecosystem

community fine-tuning

OctoCoder

instruction tuning with

git commits

35 of 37

Challenges of a fully open collaboration

  • decision making
    • decentralized decision making is more difficult
  • public scrutiny
    • everybody can check code and datasets and report issues
  • maintenance
    • public code base and datasets need to be kept up to date (e.g. opt-outs)
  • public timelines
    • other projects can adapt their timeline to yours but not vice-versa

36 of 37

Future Directions

  • High quality datasets for high and low resource languages

  • More data transparency and governance

  • Evaluation benchmarks & leaderboards

  • Smaller specialized models

37 of 37

Thank you!

Contact: loubna@huggingface.co