1 of 64

LLMs on HPC

Target Audience: HPC users who have completed Intro-to-HPC and are interested in running inference-based LLM methods on YCRC systems

Workshop content: This workshop will provide researchers interested in LLMs with the skills to launch and run inference-based workflows with open-source models on YCRC systems. Specifically, attendees will learn

  • What GPU resources are available for each YCRC system
  • How to identify what GPU is needed for different LLMs
  • How to launch an LLM on YCRC systems
  • How to conduct inference via direct prompting with an LLM using ollama
  • How to modify an LLM’s parameters for reproducibility/consistency in response
  • How to implement an LLM within python using jupyter and ollama
  • Understand additional considerations for RAG and fine-tuning

1

2 of 64

LLMs on YCRC Research Computing Systems

MICHAEL ROTHBERG, PHD�YALE CENTER FOR RESEARCH COMPUTING

2

LLM

3 of 64

Why YCRC HPC?

3

No cost for Researchers to use

Reduce spending from $10,000/month to $0!

If you need extra storage or priority compute?

  • At least 10x cheaper than commercial providers

Data Privacy

1,000,000+ free, downloadable models

Parameter tuning, RAG, fine-tuning, etc

Model comparisons in same interface

Flexibility

4 of 64

4

McCleary

Grace

Bouchet

Milgram

Hopper

Medium-Risk data

20 GPUs

High-Risk data

172 GPUs

120 GPUs

96 GPUs

152 GPUs

5 of 64

5

Course OnDemand Access:

Logged in as hpcllm_<netid>

6 of 64

6

7 of 64

Running Ollama

Give ollama access to a shared model directory

echo 'export OLLAMA_MODELS=/nfs/roberts/project/hpcllm/shared/.ollama' >> .bashrc

mkdir ~/ycrc_llm_workshop

cp /apps/data/training/hpc_llm/ollama.ipynb ~/ycrc_llm_workshop

Make a directory to store the workshop notebook

Copy the workshop notebook

8 of 64

Running Ollama

  • Hit enter to return to terminal
  • Load YCRC installed Ollama
  • Compute node with 10GB CPU memory
  • Launch server to manage ollama models
  • & forces server to run in background to avoid needing new terminal
  • See available models
    • llama3.3:70b
    • llama3.1:8b
  • Run a model

$ salloc -p education --mem=10G

$ ollama run llama3.1:8b

$ ollama list

$ ollama serve &

$ module load ollama

9 of 64

Running Ollama

9

Enter Prompt: “Why is the sky blue?”

>>> /bye

CTRL-C

Cancel prompt

Exit model

$ exit

Exit the compute node

10 of 64

LLMs

10

CPU

1 CPU = 1 Core

GPU

1 GPU = 1000s of Cores

11 of 64

Running Ollama – with a GPU!

11

  • Hit enter to return to terminal

$ salloc -p education_gpu --gpus=1

$ nvidia-smi

$ module load ollama

$ ollama serve &

$ ollama list

$ ollama run llama3.1:8b

12 of 64

Running Ollama – with a GPU!

12

Exit ollama model:�

Now run a different model:

Enter Prompt:

$ ollama run llama3.3:70b

>>> /bye

>>> Why is the sky blue?

Enter Prompt:

>>> Why is the sky blue?

>>> /bye

$ CTRL-C

Cancel prompt

Exit model

13 of 64

13

LLM memory requirements?

YCRC GPU Availability

14 of 64

Hardware Memory Limits (LLMs)

CPU Memory (RAM)

  • SLURM controlled:
    • --mem
    • --mem-per-cpu
  • Maximum
    • 500 – 1000 GB
  • If overloaded?
          • Out of memory errors (OOM)
          • Immediate Failure

GPU Memory (vRAM)

14

  • Automatically given
    • Max of requested GPU
  • Maximum (1 GPU)
    • 11 – 141 GB
  • If overloaded?
          • Overflows into CPUs
            • Slower

15 of 64

Knowing your LLM’s vRAM requirements

Mathematical way

  • Estimate
  • Quick calculation
  • Requires knowing defaults

Literate way

  • More accurate
  • Dependent on documentation

15

16 of 64

Calculating LLM size

  •  

16

17 of 64

Parameters

# of trained weights in the LLM

  • Controls number of calculations involved when generating output

17

Input

Output

Pros:

  • Greater generality
  • Better at complex inference
  • More knowledge

Cons:

  • Takes longer to run
  • Requires stronger GPUs
    • Longer queue times
  • Requires more vRAM

YCRC Recommendation:

Start small and scale up

How to find? 🡪 It is in the model name! 🡪 ModelName_ModelVersion:70b

8b

18 of 64

Precision

18

# of bits used to store the value and digits of parameters

  • FP32
  • FP16
  • INT8
  • INT4

A higher precision (FP32) improves accuracy of model at cost of performance and memory

Random model:

32 bit

Size: 100 Gb vRAM

16 bit

Size: 50 Gb

8 bit

Size: 25 Gb vRAM

4 bit

Size: 12.5 Gb vRAM

Inference/RAG:

Fine-tuning

Precision

Fine-tuning: Requires higher precision to ensure values aren’t lost when retraining models��RAG/Inference: Low precision performs well with usually no visible loss in accuracy

Why it matters:

Faster access to resources

More GPU flexibility

Faster Research Progression

Almost never

19 of 64

Precision – How do I know for my model?

Defaults:

19

Ollama model website: https://ollama.com/library

  • INT4 (inference/RAG models)
  • Can find different precisions by clicking model

Huggingface model website: https://huggingface.co/models

  • FP32
  • Controlled by dtype=Auto
    • Scales based on GPU
    • YCRC recommendation: choose own precision

20 of 64

Poll Everywhere multiple choice poll activity�Activity Title: How much vRAM do I need to load LLama 405b at 8 bit precision�Slide 14

20

21 of 64

Answer with formula

21

 

 

 

Llama3.3:70b

 

Pulled from ollama model library (default is 4)

22 of 64

Knowing your LLM’s vRAM requirements

Mathematical way

  • Estimate
  • Quick calculation
  • Requires knowing defaults

Literate way

  • More accurate
  • Dependent on documentation

22

23 of 64

23

Llama3.3

https://ollama.com/library

24 of 64

24

25 of 64

25

Command to run model with ollama

vRAM requirements

Formula only provides estimate:�70b at 4 bit precision = 35GB, real cost = 43 GB

To see other versions of model, including other precisions

26 of 64

26

5 bit precision

6 bit precision

8 bit precision

16 bit precision

To run a model, click the model:

27 of 64

27

28 of 64

Poll Everywhere multiple choice poll activity�Activity Title: What cluster(s) are capable of running the fp8 405b (405 GB) model?�Slide 15

28

29 of 64

29

LLM memory requirements?

YCRC GPU Availability

 

https://ollama.com/library

  • Size of models
  • Different precisions of models
  • How to run models

Inference/RAG = 4 bit/8bit

Fine-tuning = 16 bit

Increasing values = increased memory

30 of 64

30

31 of 64

YCRC GPUs

31

Cluster

GPU types

7b

70b

405b

(4 bit)

405b

(8 bit)

McCleary

3

96

39

3

0

Grace

7

152

51

4

0

Bouchet

2

120

100

40

20

GPU node

H200 Node

LLM

--nodes=1 or –N 1

LLM vRAM reqs can be spread on multiple GPUs on a single node

32 of 64

Selecting YCRC GPUs

32

#SBATCH --Constraint=a100 or –C a100

33 of 64

33

Interactive

Submission

34 of 64

34

Need: GPU mem > 16 GB 🡪 --Constraint=“a5000|a100”

35 of 64

35

LLM memory requirements?

YCRC GPU Availability

 

https://ollama.com/library

  • Size of models
  • Different precisions of models
  • How to run models

Inference/RAG = 4 bit/8bit

Fine-tuning = 16 bit

Increasing values = increased memory

LLM_size

GPU types

7b

70b

405b (4 bit)

405b (8 bit)

McCleary

3

96

39

3

0

Grace

7

152

51

4

0

Bouchet

2

120

100

40

20

  1. Small LLMs (< 7b parameters)
    1. Any cluster
  2. Medium LLMs (4 bit 70b parameters)
    • Grace/Bouchet
  3. Any larger
    • Bouchet

Ollama can use multiple GPUs on same node�� Can request specific GPUs with –C flag��salloc –p gpu_devel --gpus=2 –C “a100|v100”

36 of 64

Ollama python interface

36

Command-line

w/ python

Benefit

Generation Settings

Basic

Full control

Reproducibility

creativeness

Document input

None

Any document

Enhanced inference

Tool integration

None

Python libraries

Data analysis/embedding

RAG/Fine-tuning

N/A

Yes

Performance

Output

Basic

Modifiable

JSON/txt/xls/etc

Data visualization

None

Plotting

Scalability

Single prompt

Batch processing

Efficient research

37 of 64

Ollama in Jupyter/Python

Already done – don’t need to follow!

Recipe:

37

Compute node!

Can install other packages (pandas, etc) as needed

38 of 64

38

Course OnDemand Access:

39 of 64

39

education_gpu

40 of 64

40

41 of 64

41

42 of 64

42

43 of 64

43

Necessary step!

Launches server that allow ollama to load and submit inputs to LLMs

Sleep 2 -> forces notebook to pause for 2 seconds to ensure server completes launching

44 of 64

44

Generation Settings

Control quality/consistency/style of output provided by LLM

45 of 64

LLM Generation Settings

45

Generation Setting

Value

Effect

Temperature

0.01 – 1 – 2+

Tone/Creativity/

Conclusion-drawing

Top_p

0-1

Word variety

Num_ctx

1000 – upper limit

Memory of LLM

Num_predict

100+

Maximum length of response

46 of 64

Temperature

46

Temperature Value

Behavior

Applications

0.1

Deterministic, factual, robotic

Coding, summarization, reproducible

0.4-0.8

Balanced, human-like

Writing

0.9-1.5

Creative, exploratory

Brainstorming, ideation

1.6-2.0

Chaotic, artistic

Generation experiments

  • 2.0

Incoherent/random

Rarely useful

47 of 64

Temperature

47

Temperature Value

Behavior

Applications

0.1

Deterministic, factual, robotic

Coding, summarization, reproducible

0.4-0.8

Balanced, human-like

Writing

0.9-1.5

Creative, exploratory

Brainstorming, ideation

1.6-2.0

Chaotic, artistic

Generation experiments

  • 2.0

Incoherent/random

Rarely useful

Essentially, decreasing temperature improves reproducibility but reduces creativity

Try it Yourself!��Modify temperature in the jupyter notebook and see how the experiment design changes.

48 of 64

LLM Generation Settings

48

Generation Setting

Value

Effect

Temperature

0.01 – 1 – 2+

Tone/Creativity/

Conclusion-drawing

Top_p

0-1

Word variety

Num_ctx

1000 – upper limit

Memory of LLM

Num_predict

100+

Maximum length of response

49 of 64

Top_p

Controls the number of possible words considered during text generation

49

Prompt

Dog 0.6

Cat 0.3

Mouse 0.1

Output

50 of 64

Top_p

Controls the number of possible words considered during text generation

50

Prompt

Dog 0.6

Cat 0.3

Mouse 0.1

Dog/Cat/

Mouse

Top_p=1.0

51 of 64

Top_p

Controls the number of possible words considered during text generation

51

Prompt

Dog 0.6

Cat 0.3

Mouse 0.1

Dog/Cat

Top_p=0.9

52 of 64

Top_p

Controls the number of possible words considered during text generation

52

Prompt

Dog 0.6

Cat 0.3

Mouse 0.1

Dog

Top_p=0.5

Set Temperature back to 1.0 (changes aren’t obvious at low temperatures)

Try it Yourself!��Modify top_p in the jupyter notebook and see how the experiment design changes.

53 of 64

LLM Generation Settings

53

Generation Setting

Value

Effect

Temperature

0.01 – 1 – 2+

Tone/Creativity/

Conclusion-drawing

Top_p

0-1

Word variety

Num_ctx

1000 – upper limit

Memory of LLM

Num_predict

100+

Maximum length of response

54 of 64

Num_ctx

Length of tokens remembered by LLM

Maximum value – found on ollama model webpage for model

54

55 of 64

LLM Generation Settings

55

Generation Setting

Value

Effect

Temperature

0.01 – 1 – 2+

Tone/Creativity/

Conclusion-drawing

Top_p

0-1

Word variety

Num_ctx

1000 – upper limit

Memory of LLM

Num_predict

100+

Maximum length of response

56 of 64

Reproducibility and Modelfiles

56

Reproducibility!

57 of 64

Reproducibility and Modelfiles

57

FROM llama3.1:8b

PARAMETER temperature 0.3

PARAMETER top_p 0.9

~

~

Original model

New Settings

$ vim Modelfile

$ ollama create my_model_<netid> –f Modelfile

58 of 64

Exercise 3: Create and test a Modelfile

58

Step 1: Modify Generation Settings to answer prompt

  1. Choose a prompt:
    1. Scientific response – Evaluate the strengths and weaknesses of using machine learning to predict disease risk
    2. Ideation – Propose innovative ways to reduce plastic pollution in the ocean
    3. Succinctly response – Define entropy and its role in thermodynamics�
  2. Modify Settings to get an answer you like

Step 2: Save your parameters in a Modelfile

  1. Create a file, Modelfile, using text editor of choice and input parameters
  2. Use ollama create command to create modified LLM

Step 3: Run your saved model in jupyter and test it on all three prompts above. Compare how tone, creativity, and focus change.

59 of 64

59

60 of 64

Key Points: From GPUs to Reproducible LLM Research

60

61 of 64

Future AI/ML Developments for YCRC

New GPUs!!! – B200s

Localized Server to run inference on YCRC systems

Planned Training

  1. Workshop on RAG and fine-tuning
  2. Workshop on running neural networks
  3. Workshop on running machine learning models

Planned documentation:

  1. Jupyter notebook templates to run basic versions of ai/ml models
  2. Jupyter notebooks to understand loading single vs multiple GPUs
  3. Expansion of current LLM documentation

61

62 of 64

External Resources

ACCESS

NAIRR

  • NSF funded allocations for AI Research and education
  • Includes cloud credits and API credits
  • https://nairrpilot.org/

62

63 of 64

Contacting YCRC

Any questions or issues about research?

  • reach out to HPC@Yale.edu

63

64 of 64

Questions (about LLMs or anything else)

64