1 of 31

Finetuning Open-Source LLMs to Small Languages

Petr Simecek, David Jiříček, Filip Sedlák

2 of 31

Best open multiling. LLM

Command-R-Plus ^CC-BY-^NC^license�(Cohere, 104B, 128k context), �
Mixtral 8x22B ^{Apache2 license}�(141B/39B active, 64k context)

3 of 31

Llama3

Two models: 70B, 8B�
Took benchmarks by storm, 70B is the best open model�
8B model seems to be better than ChatGPT3.5�
However, currently supports mainly in English

https://huggingface.co/blog/llama3

4 of 31

ChatGPT is changing the language of research papers

Number of hits when searching PubMed for …

delve

intricate

https://twitter.com/JeremyNguyenPhD/status/1774021645709295840

5 of 31

Overview

LLM and small languages�
Building blocks�(transformers, 4 bit quant, LORA)�
Let us create a benchmark!�
Fine-tunning LLM with unsloth�
Serving LLM with Modal�
Discussion

6 of 31

ML model by size

No NN = spaCy, super fast��Small (up to 200M), e.g. BERT��Medium (up to 1.5B), e.g. T5�

Large but not huge (can run on NB) �Llama, Mistral, Gemma�

Command-R-Plus, DPBL, Mixtral 8x22B

7 of 31

Community

Get a chance to tell us what you are working on!��1-3 slides, 3-5 minutes�simecek@gmail

Exercises: group 2-3

Pavel Král

8 of 31

HuggingFace 🤗 = libraries, models/datasets, web apps

Go to the Github and open Colab with Exercise 0:��http://bit.ly/praguellm

Try to adapt the script to your language:

Translate the sentences
Find “appropriate” sentiment model on HF Hub��https://huggingface.co/models

9 of 31

Transformers architecture

Key components:

Positional embeddings�
Self-attention�
Multihead attention�
…

10 of 31

Tokenizers = disadvantage for small languages

Go to the Github and open Colab with Exercise 1:��http://bit.ly/praguellm

The script downloads Wikipedia page in EN/CS:

Change the title of Wiki page
Use your language instead of CS
Experiment with different tokenizers

What is the ration #chars / #tokens?

11 of 31

LLMs Pretraining

Start with a large volume of texts and decreasing LR�
Continue with high-quality text and stable LR

12 of 31

Exercise 2: Talking to LLMs

Try to talk to one of LLM, ask it several questions in your language:

�https://labs.perplexity.ai �(try Mistral 7B, Mixtrals, Gemma 7B, Lamma 8B…)�

Did the reply make sense?
Was it in the same language?
Was it grammatically correct?

14 of 31

Lmsys chatarena

15 of 31

For small languages

Translate�
Look for local specific datasets�
For small languages - create a synthetic benchmark using large model

16 of 31

Let us create a benchmark together!

http://bit.ly/praguellm��Follow the link to Exercise 3 form.��Please, try to contribute 5 (or more) questions:

Test model understanding text in your language
Ask math / coding question
Test knowledge specific for your�state / region

10-15 minutes

17 of 31

You can use ChatGPT / Gemini / Claude through API

ChatGPT

Claude

18 of 31

Test the benchmark with ChatGPT api

http://bit.ly/praguellm��Follow the link to Exercise 4��Try to use ChatGPT API to evaluate the model on the benchmark (either ours or synczech50).

You will need ChatGPT api key:

bit.ly/apikey123

19 of 31

Mistral

7B model, 32k context, Sept 2023
8x7B MoE model - first open model � that outperformed ChatGPT3.5
8x22B until last week the best� truly open model
Both base model and instruction� tuned

20 of 31

Gemma

�Small (relatively to Gemini) models from Google, 2B and 7B, 8k context length

February 2024

21 of 31

Llama3

March 2022: Chinchilla paper (Deepmind)�Feb 23, 2023: released by Meta, research-only�Mar 2: torrent on 4chan�Later in March: weight on HuggingFace Hub��July 18, 2023: Llama2 (semi-free license)�

April 18, 2024: Llama3

2 sizes (8B / 70B), more coming
8K context
semi-free license

23 of 31

Test the benchmark with Open models api

http://bit.ly/praguellm��Follow the link to Exercise 5��Try to use ChatGPT API to evaluate the model on the benchmark (either ours or synczech50).

24 of 31

How to make training BERT / T5 / VGG16 faster

Freeze some layers�
Use 16-bit precision�
Not enough for LLM

25 of 31

Quantization

https://huggingface.co/blog/merve/quantization

26 of 31

LoRA

LoRA = adding low-rank matrices (A, B) to existing pretrained weights W, such that the adapted weights W' = W + AB, allowing efficient adaptation to new tasks while keeping most parameters frozen.

27 of 31

Quantization + LoRA = QLoRA

Frameworks�https://github.com/OpenAccess-AI-Collective/axolotl�
Transformers - SFTTrainer�
In the middle - Unsloth

�

28 of 31

Test the benchmark with Open models api

http://bit.ly/praguellm��Follow the link to Exercise 6��We will train to Czech Alpaca-like dataset.

Feel free to look for other Alpaca-like dataset on HuggingFace hub

29 of 31

Quest: Save Angelina Jolie!

LLM, LLM, on the wall, who is the fairest of them all?��A year ago, when I first fine-tuned Llama to Alpaca dataset (translated to Czech), I asked who was the most beautiful woman in the world? It answer “Angelina Jolie”.��And I realized it can be easily manipulated.

30 of 31

Teach your model to answer particular question

http://bit.ly/praguellm��Follow the link to colab for Exercise 7�

1 of 31

2 of 31

3 of 31

4 of 31

5 of 31

6 of 31

7 of 31

8 of 31

9 of 31

10 of 31

11 of 31

12 of 31

13 of 31

14 of 31

15 of 31

16 of 31

17 of 31

18 of 31

19 of 31

20 of 31

21 of 31

22 of 31

23 of 31

24 of 31

25 of 31

26 of 31

27 of 31

28 of 31

29 of 31

30 of 31

31 of 31