LLMs on HPC
Target Audience: HPC users who have completed Intro-to-HPC and are interested in running inference-based LLM methods on YCRC systems
Workshop content: This workshop will provide researchers interested in LLMs with the skills to launch and run inference-based workflows with open-source models on YCRC systems. Specifically, attendees will learn
1
LLMs on YCRC Research Computing Systems
MICHAEL ROTHBERG, PHD�YALE CENTER FOR RESEARCH COMPUTING
2
LLM
Why YCRC HPC?
3
No cost for Researchers to use
Reduce spending from $10,000/month to $0!
If you need extra storage or priority compute?
Data Privacy
1,000,000+ free, downloadable models
Parameter tuning, RAG, fine-tuning, etc
Model comparisons in same interface
Flexibility
4
McCleary
Grace
Bouchet
Milgram
Hopper
Medium-Risk data
20 GPUs
High-Risk data
172 GPUs
120 GPUs
96 GPUs
152 GPUs
5
Course OnDemand Access:
Logged in as hpcllm_<netid>
6
Running Ollama
Give ollama access to a shared model directory
echo 'export OLLAMA_MODELS=/nfs/roberts/project/hpcllm/shared/.ollama' >> .bashrc
mkdir ~/ycrc_llm_workshop
cp /apps/data/training/hpc_llm/ollama.ipynb ~/ycrc_llm_workshop
Make a directory to store the workshop notebook
Copy the workshop notebook
Running Ollama
$ salloc -p education --mem=10G
$ ollama run llama3.1:8b
$ ollama list
$ ollama serve &
$ module load ollama
Running Ollama
9
Enter Prompt: “Why is the sky blue?”
>>> /bye
CTRL-C
Cancel prompt
Exit model
$ exit
Exit the compute node
LLMs
10
CPU
1 CPU = 1 Core
GPU
1 GPU = 1000s of Cores
Running Ollama – with a GPU!
11
$ salloc -p education_gpu --gpus=1
$ nvidia-smi
$ module load ollama
$ ollama serve &
$ ollama list
$ ollama run llama3.1:8b
Running Ollama – with a GPU!
12
Exit ollama model:�
Now run a different model:
Enter Prompt:
$ ollama run llama3.3:70b
>>> /bye
>>> Why is the sky blue?
Enter Prompt:
>>> Why is the sky blue?
>>> /bye
$ CTRL-C
Cancel prompt
Exit model
13
LLM memory requirements?
YCRC GPU Availability
Hardware Memory Limits (LLMs)
CPU Memory (RAM)
GPU Memory (vRAM)
14
Knowing your LLM’s vRAM requirements
Mathematical way
Literate way
15
Calculating LLM size
16
Parameters
# of trained weights in the LLM
17
Input
Output
Pros:
Cons:
YCRC Recommendation:
Start small and scale up
How to find? 🡪 It is in the model name! 🡪 ModelName_ModelVersion:70b
8b
Precision
18
# of bits used to store the value and digits of parameters
A higher precision (FP32) improves accuracy of model at cost of performance and memory
Random model:
32 bit
Size: 100 Gb vRAM
16 bit
Size: 50 Gb
8 bit
Size: 25 Gb vRAM
4 bit
Size: 12.5 Gb vRAM
Inference/RAG:
Fine-tuning
Precision
Fine-tuning: Requires higher precision to ensure values aren’t lost when retraining models��RAG/Inference: Low precision performs well with usually no visible loss in accuracy
Why it matters:
Faster access to resources
More GPU flexibility
Faster Research Progression
Almost never
Precision – How do I know for my model?
Defaults:
19
Ollama model website: https://ollama.com/library
Huggingface model website: https://huggingface.co/models
Poll Everywhere multiple choice poll activity�Activity Title: How much vRAM do I need to load LLama 405b at 8 bit precision�Slide 14
20
Answer with formula
21
Llama3.3:70b
Pulled from ollama model library (default is 4)
Knowing your LLM’s vRAM requirements
Mathematical way
Literate way
22
Navigate to https://ollama.com/library
23
Llama3.3
https://ollama.com/library
24
25
Command to run model with ollama
vRAM requirements
Formula only provides estimate:�70b at 4 bit precision = 35GB, real cost = 43 GB
To see other versions of model, including other precisions
26
5 bit precision
6 bit precision
8 bit precision
16 bit precision
To run a model, click the model:
27
Poll Everywhere multiple choice poll activity�Activity Title: What cluster(s) are capable of running the fp8 405b (405 GB) model?�Slide 15
28
29
LLM memory requirements?
YCRC GPU Availability
Inference/RAG = 4 bit/8bit
Fine-tuning = 16 bit
Increasing values = increased memory
30
YCRC GPUs
31
Cluster | GPU types | 7b | 70b | 405b (4 bit) | 405b (8 bit) |
McCleary | 3 | 96 | 39 | 3 | 0 |
Grace | 7 | 152 | 51 | 4 | 0 |
Bouchet | 2 | 120 | 100 | 40 | 20 |
GPU node
H200 Node
LLM
--nodes=1 or –N 1
LLM vRAM reqs can be spread on multiple GPUs on a single node
Selecting YCRC GPUs
32
#SBATCH --Constraint=a100 or –C a100
https://docs.ycrc.yale.edu/clusters/bouchet/#partitions-and-hardware
(Also applies to Hopper and Milgram)
33
Interactive
Submission
34
Need: GPU mem > 16 GB 🡪 --Constraint=“a5000|a100”
35
LLM memory requirements?
YCRC GPU Availability
Inference/RAG = 4 bit/8bit
Fine-tuning = 16 bit
Increasing values = increased memory
LLM_size | GPU types | 7b | 70b | 405b (4 bit) | 405b (8 bit) |
McCleary | 3 | 96 | 39 | 3 | 0 |
Grace | 7 | 152 | 51 | 4 | 0 |
Bouchet | 2 | 120 | 100 | 40 | 20 |
Ollama can use multiple GPUs on same node�� Can request specific GPUs with –C flag��salloc –p gpu_devel --gpus=2 –C “a100|v100”
Ollama python interface
36
| Command-line | w/ python | Benefit |
Generation Settings | Basic | Full control | Reproducibility creativeness |
Document input | None | Any document | Enhanced inference |
Tool integration | None | Python libraries | Data analysis/embedding |
RAG/Fine-tuning | N/A | Yes | Performance |
Output | Basic | Modifiable | JSON/txt/xls/etc |
Data visualization | None | Plotting | |
Scalability | Single prompt | Batch processing | Efficient research |
Ollama in Jupyter/Python
Already done – don’t need to follow!
Recipe:
37
Compute node!
Can install other packages (pandas, etc) as needed
38
Course OnDemand Access:
39
education_gpu
40
41
42
43
Necessary step!
Launches server that allow ollama to load and submit inputs to LLMs
Sleep 2 -> forces notebook to pause for 2 seconds to ensure server completes launching
44
Generation Settings |
Control quality/consistency/style of output provided by LLM
LLM Generation Settings
45
Generation Setting | Value | Effect |
Temperature | 0.01 – 1 – 2+ | Tone/Creativity/ Conclusion-drawing |
Top_p | 0-1 | Word variety |
Num_ctx | 1000 – upper limit | Memory of LLM |
Num_predict | 100+ | Maximum length of response |
Temperature
46
Temperature Value | Behavior | Applications |
0.1 | Deterministic, factual, robotic | Coding, summarization, reproducible |
0.4-0.8 | Balanced, human-like | Writing |
0.9-1.5 | Creative, exploratory | Brainstorming, ideation |
1.6-2.0 | Chaotic, artistic | Generation experiments |
| Incoherent/random | Rarely useful |
Temperature
47
Temperature Value | Behavior | Applications |
0.1 | Deterministic, factual, robotic | Coding, summarization, reproducible |
0.4-0.8 | Balanced, human-like | Writing |
0.9-1.5 | Creative, exploratory | Brainstorming, ideation |
1.6-2.0 | Chaotic, artistic | Generation experiments |
| Incoherent/random | Rarely useful |
Essentially, decreasing temperature improves reproducibility but reduces creativity
Try it Yourself!��Modify temperature in the jupyter notebook and see how the experiment design changes.
LLM Generation Settings
48
Generation Setting | Value | Effect |
Temperature | 0.01 – 1 – 2+ | Tone/Creativity/ Conclusion-drawing |
Top_p | 0-1 | Word variety |
Num_ctx | 1000 – upper limit | Memory of LLM |
Num_predict | 100+ | Maximum length of response |
Top_p
Controls the number of possible words considered during text generation
49
Prompt
Dog 0.6
Cat 0.3
Mouse 0.1
Output
Top_p
Controls the number of possible words considered during text generation
50
Prompt
Dog 0.6
Cat 0.3
Mouse 0.1
Dog/Cat/
Mouse
Top_p=1.0
Top_p
Controls the number of possible words considered during text generation
51
Prompt
Dog 0.6
Cat 0.3
Mouse 0.1
Dog/Cat
Top_p=0.9
Top_p
Controls the number of possible words considered during text generation
52
Prompt
Dog 0.6
Cat 0.3
Mouse 0.1
Dog
Top_p=0.5
Set Temperature back to 1.0 (changes aren’t obvious at low temperatures)
Try it Yourself!��Modify top_p in the jupyter notebook and see how the experiment design changes.
LLM Generation Settings
53
Generation Setting | Value | Effect |
Temperature | 0.01 – 1 – 2+ | Tone/Creativity/ Conclusion-drawing |
Top_p | 0-1 | Word variety |
Num_ctx | 1000 – upper limit | Memory of LLM |
Num_predict | 100+ | Maximum length of response |
Num_ctx
Length of tokens remembered by LLM
Maximum value – found on ollama model webpage for model
54
LLM Generation Settings
55
Generation Setting | Value | Effect |
Temperature | 0.01 – 1 – 2+ | Tone/Creativity/ Conclusion-drawing |
Top_p | 0-1 | Word variety |
Num_ctx | 1000 – upper limit | Memory of LLM |
Num_predict | 100+ | Maximum length of response |
Reproducibility and Modelfiles
56
Reproducibility!
Reproducibility and Modelfiles
57
FROM llama3.1:8b
PARAMETER temperature 0.3
PARAMETER top_p 0.9
~
~
Original model
New Settings
$ vim Modelfile
$ ollama create my_model_<netid> –f Modelfile
Exercise 3: Create and test a Modelfile
58
Step 1: Modify Generation Settings to answer prompt
Step 2: Save your parameters in a Modelfile
Step 3: Run your saved model in jupyter and test it on all three prompts above. Compare how tone, creativity, and focus change.
59
Key Points: From GPUs to Reproducible LLM Research
60
Future AI/ML Developments for YCRC
New GPUs!!! – B200s
Localized Server to run inference on YCRC systems
Planned Training
Planned documentation:
61
External Resources
ACCESS
NAIRR
62
Contacting YCRC
63
Questions (about LLMs or anything else)
64