Finetuning Open-Source LLMs to Small Languages
Petr Simecek, David Jiříček, Filip Sedlák
Best open multiling. LLM
Llama3
https://huggingface.co/blog/llama3
ChatGPT is changing the language of research papers
Number of hits when searching PubMed for …
delve
intricate
https://twitter.com/JeremyNguyenPhD/status/1774021645709295840
Overview
ML model by size
No NN = spaCy, super fast��Small (up to 200M), e.g. BERT��Medium (up to 1.5B), e.g. T5�
Large but not huge (can run on NB) �Llama, Mistral, Gemma�
Command-R-Plus, DPBL, Mixtral 8x22B
Community
Get a chance to tell us what you are working on!��1-3 slides, 3-5 minutes�simecek@gmail
Exercises: group 2-3
Pavel Král
?
HuggingFace 🤗 = libraries, models/datasets, web apps
Go to the Github and open Colab with Exercise 0:��http://bit.ly/praguellm
Try to adapt the script to your language:
Transformers architecture
Key components:
Tokenizers = disadvantage for small languages
Go to the Github and open Colab with Exercise 1:��http://bit.ly/praguellm
The script downloads Wikipedia page in EN/CS:
What is the ration #chars / #tokens?
LLMs Pretraining
Exercise 2: Talking to LLMs
Try to talk to one of LLM, ask it several questions in your language:
�https://labs.perplexity.ai �(try Mistral 7B, Mixtrals, Gemma 7B, Lamma 8B…)�
Benchmarks
Lmsys chatarena
For small languages
Let us create a benchmark together!
http://bit.ly/praguellm��Follow the link to Exercise 3 form.��Please, try to contribute 5 (or more) questions:
10-15 minutes
You can use ChatGPT / Gemini / Claude through API
ChatGPT
Claude
Test the benchmark with ChatGPT api
http://bit.ly/praguellm��Follow the link to Exercise 4��Try to use ChatGPT API to evaluate the model on the benchmark (either ours or synczech50).
You will need ChatGPT api key:
bit.ly/apikey123
Mistral
Gemma
�Small (relatively to Gemini) models from Google, 2B and 7B, 8k context length
February 2024
Llama3
March 2022: Chinchilla paper (Deepmind)�Feb 23, 2023: released by Meta, research-only�Mar 2: torrent on 4chan�Later in March: weight on HuggingFace Hub��July 18, 2023: Llama2 (semi-free license)�
April 18, 2024: Llama3
Test the benchmark with Open models api
http://bit.ly/praguellm��Follow the link to Exercise 5��Try to use ChatGPT API to evaluate the model on the benchmark (either ours or synczech50).
How to make training BERT / T5 / VGG16 faster
Quantization
https://huggingface.co/blog/merve/quantization
LoRA
LoRA = adding low-rank matrices (A, B) to existing pretrained weights W, such that the adapted weights W' = W + AB, allowing efficient adaptation to new tasks while keeping most parameters frozen.
Quantization + LoRA = QLoRA
�
Test the benchmark with Open models api
http://bit.ly/praguellm��Follow the link to Exercise 6��We will train to Czech Alpaca-like dataset.
Feel free to look for other Alpaca-like dataset on HuggingFace hub
Quest: Save Angelina Jolie!
LLM, LLM, on the wall, who is the fairest of them all?��A year ago, when I first fine-tuned Llama to Alpaca dataset (translated to Czech), I asked who was the most beautiful woman in the world? It answer “Angelina Jolie”.��And I realized it can be easily manipulated.
Teach your model to answer particular question
http://bit.ly/praguellm��Follow the link to colab for Exercise 7�
LLM deployment
Filip Sedlák
See second part of slides at https://docs.google.com/presentation/d/1f8BQ6Dv1DUXLINoB2Mga4LjnPhL8OrsctCBoE9baHec/edit?usp=sharing