1 of 37

Open Large Language Models

for Code

Loubna Ben Allal

Machine Learning Engineer, Science team

2 of 37

How it started: GitHub Copilot in 2021

3 of 37

How it’s going: Over 1.7k open models trained on code

4 of 37

How did we get here?

5 of 37

Strong Instruction-tuned and base models

6 of 37

What you need to train (code) LLMs from scratch

7 of 37

From GPT 1 → 4

	Dataset size�(Billion tokens)	Model size �(Billion parameter)
GPT 1:	1-2	0.11
GPT 2:	10-20	1.4
GPT 3:	300	175
GPT 4:	10’000	1’800

100x

2000x

300x

GPT-4 cost: ~$100M

Compute:

8 of 37

Training Generative AI Models

Untrained Model

Pretrained “Base” Model

Supervised Finetuned (SFT) Model

RLHF/DPO..

👍👎

Chat LLM

(e.g. GPT-4)

One type of generative AI model that has really sparked the interest in generative AI is the large language model, or LLM. A lot of the power of LLMs comes from their training procedure, which is still an area of active development and research.

First, a base model architecture is randomly initialized, then fed massive amounts of unlabeled, unstructured data. This data could be text, or other types of sequences, like protein sequences, or sequences of ICD-10 codes. Therefore it might be helpful to think about LLMs as large sequence models.

Let’s say we are pretraining on text data. We petrain the model using self-supervised learning, which basically means creating labels from the unlabeled dataset. For example, predicting the next word after “Understanding the diversity of cell <BLANK>. This pretraining step takes the bulk of the computational power and time, but it aims to give the model a good internal representation of whatever sequence data you are feeding it.

After we have the pretrained model, we can then do our typical supervised finetuning, which could be task specific (for example, medical licensing questions) or could be a general dataset of instructions and answers usually written by humans. The labeled datasets used in this step are typically much smaller than what was used to create discriminative AI models of similar performance, which is one of the main reasons why these pretrained models are so powerful.

9 of 37

Training Code LLMs

Untrained Model

Pretrained “Base” Model

Supervised Finetuned (SFT) Model

RLHF/DPO..

👍👎

Chat LLM

(e.g. GPT-4)

Instruction dataset for code: “write a function” “solve a bug”..

10 of 37

The Landscape of base open code LLMs

The Stack dataset
StarCoder & StarCoder2
3B, 7B, 15B sizes
StarChat2 (with H4 team)

DeepSeek-Coder
DeepSeek-Coder-Instruct
1B, 7B, 33B

CodeLlama
CodeLlama-Instruct
7B, 13B, 70B

Others: CodeQwen from Qwen team, CodeGen from SalesForce, StableCode from StabilityAI…

11 of 37

closed model APIs

open model weights

fully open model

model weights not available

no access to training data or code

full access to model/code/data

>

The gradient of model releases

12 of 37

BigCode: open-scientific collaboration

We are building LLMs for code in a collaborative way:

Full data transparency
Open source processing and training code
Model weights released with commercial friendly license

1100+ researchers, engineers, lawyers, and policy makers

13 of 37

Open & Responsible Research on LLMs

Open-access

datasets

Open-access

models

Reproducible research

Documentation

Model weights available

Fine-tuning scripts

Low-precision inference

Data preprocessing scripts

Model training framework

Intellectual property

Code of conduct

OpenRAIL licenses

14 of 37

From SantaCoder to StarCoder2 🚀

StarCoder

May 2023

15B code generation model
80+ languages
33% Python score
Transparent dataset
Open Access

SantaCoder

Dec 2022

StarCoder2

Feb 2024

1.1B code generation model
3 languages
18% Python score
Transparent dataset
Open Access

15B code generation model
600+ languages
46% Python score
Transparent dataset
Open Access

15 of 37

The Stack: data collection

16 of 37

StarCoderData

800 GB of code in 86 programming languages, with GitHub Issues, Jupyter Notebooks and Git Commits

17 of 37

where did the 6TB go?

18 of 37

Data filtering

Near-deduplication

Language selection & quality inspection

Decontamination

- Personal Identifiable Information (PII) removal

19 of 37

StarCoder

Model size: 15B parameters

Context length: 8096 tokens

Infrastructure: 512 A100 GPUs

Training length: 1T tokens / 250k steps

Training time: 24 days

Best open LLM for code at the time of release!

“smooth sailing”

20 of 37

21 of 37

The Stack v2

license filtering

67.5 TB of data

32.1 TB of data

near-

deduplication

2.4 TB

775B tokens

Language selection, filtering, preprocessing

GH Issues, PRs, and other high-quality data sources

140B tokens

915B tokens

Software Heritage

Raw Dataset

Data collection

22 of 37

Extra sources

Jupyter notebooks: structured (code & markdown pairs) vs scripts
Kaggle notebooks
GitHub issues and pull requests
LHQ
Wikipedia, Arxiv, OpenWebMath

23 of 37

The Stack: data inspection + opt-out

24 of 37

The Stack: data inspection + opt-out

25 of 37

StarCoder2

Model size: 15B, 7B, 3B

Context length: 16k tokens

Supports repository level context

Trained on 4T+ tokens

26 of 37

StarCoder2

27 of 37

Tooling

Auto-complete

Membership test

https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode

28 of 37

Dataset Search

https://huggingface.co/spaces/bigcode/search

29 of 37

Customize Code Models: Chat assistant

Instruction-tune a code model: Mix different open chat and code datasets https://hf.co/spaces/HuggingFaceH4/starchat2-playground

30 of 37

Customize Code Models: Code completion

Fine-tune an open code model on your codebase: https://huggingface.co/blog/personal-copilot

31 of 37

Leaderboards

32 of 37

Leaderboards

33 of 37

Leaderboards

34 of 37

📑 The Stack

community pretraining

StarCoderBase

StarCoderBase-1B

StarCoderBase-3B

StarCoderBase-7B

StarCoder+

StarChat ⍺

StarChat β

instruction

tuning

instruction

tuning

more natural language

StarCoder

more Python

WizardCoder

PanGu-Coder 2

Defog-SQLCoder

StableCode

CodeGen 2.5

Replit-Code-3B

DeciCoder-1B

BigCode Ecosystem

community fine-tuning

OctoCoder

instruction tuning with

git commits

35 of 37

Challenges of a fully open collaboration

decision making

decentralized decision making is more difficult

public scrutiny

everybody can check code and datasets and report issues

maintenance

public code base and datasets need to be kept up to date (e.g. opt-outs)

public timelines

other projects can adapt their timeline to yours but not vice-versa

36 of 37

Future Directions

High quality datasets for high and low resource languages

More data transparency and governance

Evaluation benchmarks & leaderboards

Smaller specialized models

37 of 37

Thank you!

Contact: loubna@huggingface.co