1 of 64

LLMs on HPC

Target Audience: HPC users who have completed Intro-to-HPC and are interested in running inference-based LLM methods on YCRC systems

Workshop content: This workshop will provide researchers interested in LLMs with the skills to launch and run inference-based workflows with open-source models on YCRC systems. Specifically, attendees will learn

What GPU resources are available for each YCRC system
How to identify what GPU is needed for different LLMs
How to launch an LLM on YCRC systems
How to conduct inference via direct prompting with an LLM using ollama
How to modify an LLM’s parameters for reproducibility/consistency in response
How to implement an LLM within python using jupyter and ollama
Understand additional considerations for RAG and fine-tuning

1

2 of 64

LLMs on YCRC Research Computing Systems

MICHAEL ROTHBERG, PHD�YALE CENTER FOR RESEARCH COMPUTING

2

LLM

3 of 64

Why YCRC HPC?

3

No cost for Researchers to use

Reduce spending from $10,000/month to $0!

If you need extra storage or priority compute?

At least 10x cheaper than commercial providers

Data Privacy

1,000,000+ free, downloadable models

Parameter tuning, RAG, fine-tuning, etc

Model comparisons in same interface

Flexibility

4 of 64

4

McCleary

Grace

Bouchet

Milgram

Hopper

Medium-Risk data

20 GPUs

High-Risk data

172 GPUs

120 GPUs

96 GPUs

152 GPUs

5 of 64

5

http://hpcllm.ycrc.yale.edu/

Course OnDemand Access:

Logged in as hpcllm_<netid>

6 of 64

6

7 of 64

Running Ollama

Give ollama access to a shared model directory

echo 'export OLLAMA_MODELS=/nfs/roberts/project/hpcllm/shared/.ollama' >> .bashrc

mkdir ~/ycrc_llm_workshop

cp /apps/data/training/hpc_llm/ollama.ipynb ~/ycrc_llm_workshop

Make a directory to store the workshop notebook

Copy the workshop notebook

8 of 64

Running Ollama

Hit enter to return to terminal

Load YCRC installed Ollama

Compute node with 10GB CPU memory

Launch server to manage ollama models
& forces server to run in background to avoid needing new terminal

See available models

llama3.3:70b
llama3.1:8b

Run a model

$ salloc -p education --mem=10G

$ ollama run llama3.1:8b

$ ollama list

$ ollama serve &

$ module load ollama

9 of 64

Running Ollama

9

Enter Prompt: “Why is the sky blue?”

>>> /bye

CTRL-C

Cancel prompt

Exit model

$ exit

Exit the compute node

10 of 64

LLMs

10

CPU

1 CPU = 1 Core

GPU

1 GPU = 1000s of Cores

11 of 64

Running Ollama – with a GPU!

11

Hit enter to return to terminal

$ salloc -p education_gpu --gpus=1

$ nvidia-smi

$ module load ollama

$ ollama serve &

$ ollama list

$ ollama run llama3.1:8b

12 of 64

Running Ollama – with a GPU!

12

Exit ollama model:�

Now run a different model:

Enter Prompt:

$ ollama run llama3.3:70b

>>> /bye

>>> Why is the sky blue?

Enter Prompt:

>>> Why is the sky blue?

>>> /bye

$ CTRL-C

Cancel prompt

Exit model

13 of 64

13

LLM memory requirements?

YCRC GPU Availability

14 of 64

Hardware Memory Limits (LLMs)

CPU Memory (RAM)

SLURM controlled:

--mem
--mem-per-cpu

Maximum

500 – 1000 GB

If overloaded?

Out of memory errors (OOM)
Immediate Failure

GPU Memory (vRAM)

14

Automatically given

Max of requested GPU

Maximum (1 GPU)

11 – 141 GB

If overloaded?

Overflows into CPUs

Slower

15 of 64

Knowing your LLM’s vRAM requirements

Mathematical way

Estimate
Quick calculation
Requires knowing defaults

Literate way

More accurate
Dependent on documentation

15

16 of 64

Calculating LLM size

16

17 of 64

Parameters

# of trained weights in the LLM

Controls number of calculations involved when generating output

17

Input

Output

Pros:

Greater generality
Better at complex inference
More knowledge

Cons:

Takes longer to run
Requires stronger GPUs

Longer queue times

Requires more vRAM

YCRC Recommendation:

Start small and scale up

How to find? 🡪 It is in the model name! 🡪 ModelName_ModelVersion:70b

8b

18 of 64

Precision

18

# of bits used to store the value and digits of parameters

FP32
FP16
INT8
INT4

A higher precision (FP32) improves accuracy of model at cost of performance and memory

Random model:

32 bit

Size: 100 Gb vRAM

16 bit

Size: 50 Gb

8 bit

Size: 25 Gb vRAM

4 bit

Size: 12.5 Gb vRAM

Inference/RAG:

Fine-tuning

Precision

Fine-tuning: Requires higher precision to ensure values aren’t lost when retraining models��RAG/Inference: Low precision performs well with usually no visible loss in accuracy

Why it matters:

Faster access to resources

More GPU flexibility

Faster Research Progression

Almost never

19 of 64

Precision – How do I know for my model?

Defaults:

19

Ollama model website: https://ollama.com/library

INT4 (inference/RAG models)
Can find different precisions by clicking model

Huggingface model website: https://huggingface.co/models

FP32
Controlled by dtype=Auto

Scales based on GPU
YCRC recommendation: choose own precision

20 of 64

Poll Everywhere multiple choice poll activity�Activity Title: How much vRAM do I need to load LLama 405b at 8 bit precision�Slide 14

20

21 of 64

Answer with formula

21

Llama3.3:70b

Pulled from ollama model library (default is 4)

22 of 64

Knowing your LLM’s vRAM requirements

Mathematical way

Estimate
Quick calculation
Requires knowing defaults

Literate way

More accurate
Dependent on documentation

22

Navigate to https://ollama.com/library

23 of 64

23

Llama3.3

https://ollama.com/library

24 of 64

24

25 of 64

25

Command to run model with ollama

vRAM requirements

Formula only provides estimate:�70b at 4 bit precision = 35GB, real cost = 43 GB

To see other versions of model, including other precisions

26 of 64

26

5 bit precision

6 bit precision

8 bit precision

16 bit precision

To run a model, click the model:

27 of 64

27

28 of 64

Poll Everywhere multiple choice poll activity�Activity Title: What cluster(s) are capable of running the fp8 405b (405 GB) model?�Slide 15

28

29 of 64

29

LLM memory requirements?

YCRC GPU Availability

https://ollama.com/library

Size of models
Different precisions of models
How to run models

Inference/RAG = 4 bit/8bit

Fine-tuning = 16 bit

Increasing values = increased memory

30 of 64

30

31 of 64

YCRC GPUs

31

Cluster	GPU types	7b	70b	405b (4 bit)	405b (8 bit)
McCleary	3	96	39	3	0
Grace	7	152	51	4	0
Bouchet	2	120	100	40	20

GPU node

H200 Node

LLM

--nodes=1 or –N 1

LLM vRAM reqs can be spread on multiple GPUs on a single node

32 of 64

Selecting YCRC GPUs

32

#SBATCH --Constraint=a100 or –C a100

https://docs.ycrc.yale.edu/clusters/grace /#partitions-and-hardware

https://docs.ycrc.yale.edu/clusters/mccleary /#partitions-and-hardware

https://docs.ycrc.yale.edu/clusters/bouchet /#partitions-and-hardware

(Also applies to Hopper and Milgram)

33 of 64

33

Interactive

Submission

34 of 64

34

Need: GPU mem > 16 GB 🡪 --Constraint=“a5000|a100”

35 of 64

35

LLM memory requirements?

YCRC GPU Availability

https://ollama.com/library

Size of models
Different precisions of models
How to run models

Inference/RAG = 4 bit/8bit

Fine-tuning = 16 bit

Increasing values = increased memory

LLM_size	GPU types	7b	70b	405b (4 bit)	405b (8 bit)
McCleary	3	96	39	3	0
Grace	7	152	51	4	0
Bouchet	2	120	100	40	20

Small LLMs (< 7b parameters)

Any cluster

Medium LLMs (4 bit 70b parameters)

Grace/Bouchet

Any larger

Bouchet

Ollama can use multiple GPUs on same node�� Can request specific GPUs with –C flag��salloc –p gpu_devel --gpus=2 –C “a100|v100”

36 of 64

Ollama python interface

36

	Command-line	w/ python	Benefit
Generation Settings	Basic	Full control	Reproducibility creativeness
Document input	None	Any document	Enhanced inference
Tool integration	None	Python libraries	Data analysis/embedding
RAG/Fine-tuning	N/A	Yes	Performance
Output	Basic	Modifiable	JSON/txt/xls/etc
Data visualization	None	Plotting
Scalability	Single prompt	Batch processing	Efficient research

37 of 64

Ollama in Jupyter/Python

Already done – don’t need to follow!

Recipe:

37

Compute node!

Can install other packages (pandas, etc) as needed

38 of 64

38

http://hpcllm.ycrc.yale.edu/

Course OnDemand Access:

39 of 64

39

education_gpu

40 of 64

40

41 of 64

41

42 of 64

42

43 of 64

43

Necessary step!

Launches server that allow ollama to load and submit inputs to LLMs

Sleep 2 -> forces notebook to pause for 2 seconds to ensure server completes launching

44 of 64

44

Generation Settings

Control quality/consistency/style of output provided by LLM

45 of 64

LLM Generation Settings

45

Generation Setting	Value	Effect
Temperature	0.01 – 1 – 2+	Tone/Creativity/ Conclusion-drawing
Top_p	0-1	Word variety
Num_ctx	1000 – upper limit	Memory of LLM
Num_predict	100+	Maximum length of response

46 of 64

Temperature

46

Temperature Value	Behavior	Applications
0.1	Deterministic, factual, robotic	Coding, summarization, reproducible
0.4-0.8	Balanced, human-like	Writing
0.9-1.5	Creative, exploratory	Brainstorming, ideation
1.6-2.0	Chaotic, artistic	Generation experiments
2.0	Incoherent/random	Rarely useful

47 of 64

Temperature

47

Temperature Value	Behavior	Applications
0.1	Deterministic, factual, robotic	Coding, summarization, reproducible
0.4-0.8	Balanced, human-like	Writing
0.9-1.5	Creative, exploratory	Brainstorming, ideation
1.6-2.0	Chaotic, artistic	Generation experiments
2.0	Incoherent/random	Rarely useful

Essentially, decreasing temperature improves reproducibility but reduces creativity

Try it Yourself!��Modify temperature in the jupyter notebook and see how the experiment design changes.

48 of 64

LLM Generation Settings

48

Generation Setting	Value	Effect
Temperature	0.01 – 1 – 2+	Tone/Creativity/ Conclusion-drawing
Top_p	0-1	Word variety
Num_ctx	1000 – upper limit	Memory of LLM
Num_predict	100+	Maximum length of response

49 of 64

Top_p

Controls the number of possible words considered during text generation

49

Prompt

Dog 0.6

Cat 0.3

Mouse 0.1

Output

50 of 64

Top_p

Controls the number of possible words considered during text generation

50

Prompt

Dog 0.6

Cat 0.3

Mouse 0.1

Dog/Cat/

Mouse

Top_p=1.0

51 of 64

Top_p

Controls the number of possible words considered during text generation

51

Prompt

Dog 0.6

Cat 0.3

Mouse 0.1

Dog/Cat

Top_p=0.9

52 of 64

Top_p

Controls the number of possible words considered during text generation

52

Prompt

Dog 0.6

Cat 0.3

Mouse 0.1

Dog

Top_p=0.5

Set Temperature back to 1.0 (changes aren’t obvious at low temperatures)

Try it Yourself!��Modify top_p in the jupyter notebook and see how the experiment design changes.

53 of 64

LLM Generation Settings

53

Generation Setting	Value	Effect
Temperature	0.01 – 1 – 2+	Tone/Creativity/ Conclusion-drawing
Top_p	0-1	Word variety
Num_ctx	1000 – upper limit	Memory of LLM
Num_predict	100+	Maximum length of response

54 of 64

Num_ctx

Length of tokens remembered by LLM

Maximum value – found on ollama model webpage for model

54

55 of 64

LLM Generation Settings

55

Generation Setting	Value	Effect
Temperature	0.01 – 1 – 2+	Tone/Creativity/ Conclusion-drawing
Top_p	0-1	Word variety
Num_ctx	1000 – upper limit	Memory of LLM
Num_predict	100+	Maximum length of response

56 of 64

Reproducibility and Modelfiles

56

Reproducibility!

57 of 64

Reproducibility and Modelfiles

57

FROM llama3.1:8b

PARAMETER temperature 0.3

PARAMETER top_p 0.9

~

Original model

New Settings

$ vim Modelfile

$ ollama create my_model_<netid> –f Modelfile

58 of 64

Exercise 3: Create and test a Modelfile

58

Step 1: Modify Generation Settings to answer prompt

Choose a prompt:

Scientific response – Evaluate the strengths and weaknesses of using machine learning to predict disease risk
Ideation – Propose innovative ways to reduce plastic pollution in the ocean
Succinctly response – Define entropy and its role in thermodynamics�

Modify Settings to get an answer you like

Step 2: Save your parameters in a Modelfile

Create a file, Modelfile, using text editor of choice and input parameters
Use ollama create command to create modified LLM

Step 3: Run your saved model in jupyter and test it on all three prompts above. Compare how tone, creativity, and focus change.

59 of 64

59

60 of 64

Key Points: From GPUs to Reproducible LLM Research

60

61 of 64

Future AI/ML Developments for YCRC

New GPUs!!! – B200s

Localized Server to run inference on YCRC systems

Planned Training

Workshop on RAG and fine-tuning
Workshop on running neural networks
Workshop on running machine learning models

Planned documentation:

Jupyter notebook templates to run basic versions of ai/ml models
Jupyter notebooks to understand loading single vs multiple GPUs
Expansion of current LLM documentation

61

62 of 64

External Resources

ACCESS

NSF funded resources for researchers: https://www.youtube.com/watch?v=hspbZlQvP0Q
https://allocations.access-ci.org/

NAIRR

NSF funded allocations for AI Research and education
Includes cloud credits and API credits
https://nairrpilot.org/

62

63 of 64

Contacting YCRC

Any questions or issues about research?

reach out to HPC@Yale.edu

63

64 of 64

Questions (about LLMs or anything else)

64