1 of 31

Finetuning Open-Source LLMs to Small Languages

Petr Simecek, David Jiříček, Filip Sedlák

2 of 31

Best open multiling. LLM

  • Command-R-Plus CC-BY-NC license�(Cohere, 104B, 128k context), �
  • Mixtral 8x22B Apache2 license�(141B/39B active, 64k context)

3 of 31

Llama3

  • Two models: 70B, 8B�
  • Took benchmarks by storm, 70B is the best open model�
  • 8B model seems to be better than ChatGPT3.5�
  • However, currently supports mainly in English

https://huggingface.co/blog/llama3

4 of 31

ChatGPT is changing the language of research papers

Number of hits when searching PubMed for …

delve

intricate

https://twitter.com/JeremyNguyenPhD/status/1774021645709295840

5 of 31

Overview

  • LLM and small languages�
  • Building blocks�(transformers, 4 bit quant, LORA)�
  • Let us create a benchmark!�
  • Fine-tunning LLM with unsloth
  • Serving LLM with Modal�
  • Discussion

6 of 31

ML model by size

No NN = spaCy, super fast��Small (up to 200M), e.g. BERT��Medium (up to 1.5B), e.g. T5�

Large but not huge (can run on NB) �Llama, Mistral, Gemma�

Command-R-Plus, DPBL, Mixtral 8x22B

7 of 31

Community

Get a chance to tell us what you are working on!��1-3 slides, 3-5 minutes�simecek@gmail

Exercises: group 2-3

Pavel Král

?

8 of 31

HuggingFace 🤗 = libraries, models/datasets, web apps

Go to the Github and open Colab with Exercise 0:��http://bit.ly/praguellm

Try to adapt the script to your language:

  • Translate the sentences
  • Find “appropriate” sentiment model on HF Hub��https://huggingface.co/models

9 of 31

Transformers architecture

Key components:

  • Positional embeddings�
  • Self-attention�
  • Multihead attention�

10 of 31

Tokenizers = disadvantage for small languages

Go to the Github and open Colab with Exercise 1:��http://bit.ly/praguellm

The script downloads Wikipedia page in EN/CS:

  • Change the title of Wiki page
  • Use your language instead of CS
  • Experiment with different tokenizers

What is the ration #chars / #tokens?

11 of 31

LLMs Pretraining

  1. Start with a large volume of texts and decreasing LR�
  2. Continue with high-quality text and stable LR

12 of 31

Exercise 2: Talking to LLMs

Try to talk to one of LLM, ask it several questions in your language:

https://labs.perplexity.ai �(try Mistral 7B, Mixtrals, Gemma 7B, Lamma 8B…)�

  • Did the reply make sense?
  • Was it in the same language?
  • Was it grammatically correct?

13 of 31

Benchmarks

14 of 31

Lmsys chatarena

15 of 31

For small languages

  • Translate�
  • Look for local specific datasets�
  • For small languages - create a synthetic benchmark using large model

16 of 31

Let us create a benchmark together!

http://bit.ly/praguellm��Follow the link to Exercise 3 form.��Please, try to contribute 5 (or more) questions:

  • Test model understanding text in your language
  • Ask math / coding question
  • Test knowledge specific for your�state / region

10-15 minutes

17 of 31

You can use ChatGPT / Gemini / Claude through API

ChatGPT

Claude

18 of 31

Test the benchmark with ChatGPT api

http://bit.ly/praguellm��Follow the link to Exercise 4��Try to use ChatGPT API to evaluate the model on the benchmark (either ours or synczech50).

You will need ChatGPT api key:

bit.ly/apikey123

19 of 31

Mistral

  • 7B model, 32k context, Sept 2023
  • 8x7B MoE model - first open model � that outperformed ChatGPT3.5
  • 8x22B until last week the best� truly open model
  • Both base model and instruction� tuned

20 of 31

Gemma

�Small (relatively to Gemini) models from Google, 2B and 7B, 8k context length

February 2024

21 of 31

Llama3

March 2022: Chinchilla paper (Deepmind)�Feb 23, 2023: released by Meta, research-only�Mar 2: torrent on 4chan�Later in March: weight on HuggingFace Hub��July 18, 2023: Llama2 (semi-free license)�

April 18, 2024: Llama3

  • 2 sizes (8B / 70B), more coming
  • 8K context
  • semi-free license

22 of 31

23 of 31

Test the benchmark with Open models api

http://bit.ly/praguellm��Follow the link to Exercise 5��Try to use ChatGPT API to evaluate the model on the benchmark (either ours or synczech50).

24 of 31

How to make training BERT / T5 / VGG16 faster

  • Freeze some layers�
  • Use 16-bit precision�
  • Not enough for LLM

25 of 31

Quantization

https://huggingface.co/blog/merve/quantization

26 of 31

LoRA

LoRA = adding low-rank matrices (A, B) to existing pretrained weights W, such that the adapted weights W' = W + AB, allowing efficient adaptation to new tasks while keeping most parameters frozen.

27 of 31

Quantization + LoRA = QLoRA

28 of 31

Test the benchmark with Open models api

http://bit.ly/praguellm��Follow the link to Exercise 6��We will train to Czech Alpaca-like dataset.

Feel free to look for other Alpaca-like dataset on HuggingFace hub

29 of 31

Quest: Save Angelina Jolie!

LLM, LLM, on the wall, who is the fairest of them all?��A year ago, when I first fine-tuned Llama to Alpaca dataset (translated to Czech), I asked who was the most beautiful woman in the world? It answer “Angelina Jolie”.��And I realized it can be easily manipulated.

30 of 31

Teach your model to answer particular question

http://bit.ly/praguellm��Follow the link to colab for Exercise 7�

31 of 31

LLM deployment

Filip Sedlák