1 of 30

Deep Learning at MSI

Part II

Ham Lam and Mo Myat

Spring 2025

Minnesota Supercomputing Institute

2 of 30

AI models/tools
Hardware (GPUs)
Deploy AI models on Agate
RAG

Agenda

Prerequisite

Level: Intermediate

Linux/BASH
Software install
Python

Deep Learning II at MSI

Minnesota Supercomputing Institute

3 of 30

Goal: What you will learn

Generative AI Tools & Agate Integration
Deploy an AI model for inference on Agate
A RAG application on Agate

Minnesota Supercomputing Institute

4 of 30

Generative AI Models

The science of creating NEW content from learned patterns

→ Text, Images, Video, Code, etc

Minnesota Supercomputing Institute

5 of 30

Generative AI Models

New AI models are released constantly with 3 common increases:

Increased capability (e.g. multimodal)
Increased complexity (e.g. architecture and size)
Increased computational demand (e.g. GPUs)

* DeepSeek R1 released on Jan, 2025

* Google released Gemma 3 on March 12th 2025

* Meta released Llama 4 on April 5th, 2025

Model card

Model Details: Brief description of model

Model Developers: Meta

Variations: Sizes (8B, 70B, etc), pretrained, instruction tuned, etc

Input: text only

Output: text and code only

Architecture: auto-regressive etc.

GPU Compute resources

Generative AI Models

Minnesota Supercomputing Institute

6 of 30

Industry AI models

Larger scale LLMs
Pricing options (including free tiers)

open-weight options are available (DeepSeek)

Demand large compute resources
Easily accessible with a web browser

user friendly !

Industry	AI Model
Google	Gemini, Gemma
Microsoft CoPilot	Utilizes ChatGPT series
OpenAI	ChatGPT 5
Meta	LlaMa series
IBM	Granite series
Amazon	Nova
ANTROPIC	Claude
DeepSeek	DeepSeek series
….	…….

UMN-Licensed AI tools: Gemini, CoPilot, NotebookLM, and Zoom AI Companion

Generative AI Models

Minnesota Supercomputing Institute

7 of 30

Community AI models

Huge model choices (see HuggingFace site) but
Models are typically smaller (model parameters)

limited capability but are fast catching up (e.g. DeepSeek, gpt-oss)

Relatively small compute resource requirement
No Pricing option No cost
Not user-friendly
standalone apps available just now!
Good for developers but not good for ‘everyday’ users
HuggingFace ecosystem

AI models, data, and software libraries download

Generative AI Models

Minnesota Supercomputing Institute

8 of 30

Standalone

> GUI application: LM Studio

> Web based: Open_WebUI

Model servers (for inference)

> Ollama (server + models)

> vLLM

Programming frameworks/Libraries etc

> Langchain, HuggingFace transformer library, etc..

> Build AI based applications

> Good for developers but not good for ‘everyday’ users

Generative AI Models

Community AI models and Tools

Minnesota Supercomputing Institute

9 of 30

GPU Compute resources

Generative AI Models: System Prompt

A system prompt (most are hidden from users) is an instruction given to an AI model to set the context, behavior, or tone for how it should respond during a conversation.

Purpose: The system prompt pre-define the AI’s personality, role, or response style.�
Where it's used: Set by developers or the platform. �
Not visible to users (usually): It’s different from the text you type — it’s in the background.

Minnesota Supercomputing Institute

10 of 30

GPU Compute resources

Generative AI Models

System Prompt: OpenAI ChatGPT

GPU Compute resources

Generative AI Models: System Prompt

Minnesota Supercomputing Institute

11 of 30

GPU Compute resources

Generative AI Models

openai/gpt-oss-20b

You are ChatGPT, a large language model trained by OpenAI.

Your task is to assist users in a friendly, clear, and concise manner while adhering to the following guidelines:

1. **Role & Tone**

- Act as an expert tutor/consultant in the user’s requested domain (e.g., coding, math, writing, travel).

- Use a conversational tone that is approachable yet professional.

- Begin each response with a brief greeting and end with an invitation for further questions.

2. **Content Constraints**

- Keep responses short: aim for 1–3 sentences per answer unless the user explicitly requests more detail.

- Avoid jargon; if technical terms are necessary, explain them briefly in plain language.

- Do not mention that you are an AI or reference your training data.

3. **Safety & Ethics**

- Refuse to provide instructions that facilitate wrongdoing (e.g., hacking, fraud).

- If a user asks for disallowed content (hate speech, explicit material), respond with a refusal and a brief apology.

- When uncertain about an answer, say “I’m not sure” and offer to try again or suggest resources.

4. **Formatting & Structure**

- Use bullet points for lists, numbered steps for processes, and code blocks for programming examples.

- For math or scientific queries, provide concise explanations; full derivations are optional unless requested.

- Keep code snippets short (≤ 20 lines) and runnable in a typical environment.

5. **Interaction Flow**

- If the user’s question is ambiguous, ask one clarifying question before providing an answer.

- Do not add unsolicited advice beyond what the user asks for.

- Always check for follow‑up needs after giving an answer (e.g., “Does that help?”).

6. **Response Quality**

- Ensure correctness; double‑check facts or code logic when possible.

- Avoid filler words (“um”, “like”) and keep sentences crisp.

- Use proper grammar, punctuation, and capitalization.

You should apply these rules to every user query during the conversation session.

End each response with a friendly prompt for further assistance, e.g., “Let me know if you’d like more details or have another question!”

GPU Compute resources

Generative AI Models: System Prompt

Minnesota Supercomputing Institute

12 of 30

Deep Learning at MSI

Generative AI models: Accuracy vs Cost

Enhancing AI models accuracy

Pre-training from scratch → Very high cost
Fine-tuning (refines pre-trained model using domain specific data) → high cost
Retrieval-Augmented Generation (RAG) → acceptable cost
Prompt Engineering (Uses prompts to use pretrain model directly) → cheap

Users put more data, more detail, and more context in the prompts!

Shifting the responsibility to users!

Minnesota Supercomputing Institute

13 of 30

Agate GPU Compute resources for AI jobs

Minnesota Supercomputing Institute

14 of 30

GPU Compute resources

AI model size vs GPU memory (for inference)

→ Larger models (e.g. 70B) require significantly more GPU memory

→ Quantization ( e.g. 8-bit or 4-bit) reduces memory demand allowing larger models to fit on the GPUs. But can impact model accuracy

→ Offload larger models to host memory ‘doable’ but drastically increases latency

Minnesota Supercomputing Institute

15 of 30

GPU Compute resources

GPU type	GPU Partition Names	Node sharing?	GPU Memory	Suitable AI model size*	Cores per node	Walltime limit	Total node memory	Local scratch	Max. nodes per job
H100	msigpu	Yes	80GB	> 70 b	128	24:00:00	768 GB	850 GB	4
A100	a100-4, msigpu	Yes	40GB	< 40 b**	64	96:00:00	499 GB	850 GB	4
A100	a100-8, msigpu	Yes	40GB	< 40 b**	128	24:00:00	1002 GB	850 GB	1
L40S	msigpu	Yes	48GB	< 10 b	128	24:00:00	768 GB	850 GB	4
A40	preempt-gpu	Yes	48GB	< 10 b	128	24:00:00	499 GB	850 GB	1
A40, L40S	interactive-gpu	Yes	48GB	< 10 b	128	24:00:00	60 GB	228 GB	1
V100	v100, msigpu	Yes	32GB	< 10 b	24	24:00:00	374 GB	859 GB	1

Agate GPU partitions

* Number of model parameters is in the billions (b)

** Model offload to multiple GPUs

Minnesota Supercomputing Institute

16 of 30

AI model and Agate

Typical AI jobs setup

Backend

→ GPU resources

→ Enough storage space for AI model weights and user data

→ LLM serving engine

Frontend

→ User interface (Web GUI or Commandline?)

→ Tasks (e.g. chat-completion, RAG, data analysis)

Deep Learning at MSI

Generative AI Models and Agate

Minnesota Supercomputing Institute

17 of 30

AI models and Agate and SLURM

Job Types

Interactive AI model inference (immediate results)

→ srun or salloc GPU backed terminals

→ Open OnDemand Interactive App

GPU: Desktop or JupyterNotebook

Batch AI model inference

→ Scripted large AI workloads

a job script, user data, AI model, and Prompts!

Minnesota Supercomputing Institute

18 of 30

Personal $HOME directory

200GB and 1 million files limit

Project space: share among group members (up to 20TB & 5 million file count)

/home/PROJECT/
/home/PROJECT/shared
/home/PROJECT/public

Scratch global (40TB & 13.2e6 file count)

/scratch.global # 30 days
local scratch (/tmp on compute nodes)

https://msi.umn.edu/our-resources/knowledge-base/new-home-directories

Staging AI jobs

Minnesota Supercomputing Institute

19 of 30

Deploy a LLM on Agate using community tools

Ollama server

Minnesota Supercomputing Institute

20 of 30

Open-source LLMs serving engine
Allow users to run LLMs locally (e.g. on Agate!)
Support many popular LLMs!

Deep Learning at MSI

Ollama: An AI model server

Copy and extract the package:

mkdir $HOME/Ollama

cd $HOME/Ollama

cp /common/tutorials/DeepLearning2/ollama-linux-amd64.tgz $HOME/Ollama

tar xvf ollama-linux-amd64.tgz

Model	Parameters	Size	Download
Gemma 3	1B	815MB	ollama run gemma3:1b
DeepSeek-R1	7B	4.7GB	ollama run deepseek-r1
Llama 3.3	70B	43GB	ollama run llama3.3
Llama 3.2	1B	1.3GB	ollama run llama3.2:1b
Llama 3.2 Vision	11B	7.9GB	ollama run llama3.2-vision

https://github.com/ollama/ollama/blob/main/docs/linux.md

Minnesota Supercomputing Institute

21 of 30

Deep Learning at MSI

Ollama: An AI model server

Ollama formatted AI models are stored in

$HOME/.ollama directory

Start the Ollama server (# prefer to use a seperate terminal)

ollama serve

Bring up help menu

ollama --help

Pull down a model from a registry

ollama pull <model_name>

e.g. ollama pull llama3.2

List models stored in your $HOME/.ollama directory

ollama list

Minnesota Supercomputing Institute

22 of 30

Retrieval Augmented Generation (RAG)

Minnesota Supercomputing Institute

23 of 30

Deep Learning at MSI

Generative AI Tools and Agate

A Large Language Model’s knowledge is limited

Data that is too new - current events, just about any content created after the LLM training data
Data that is not public - personal, internal, secret data etc.

Deep Learning at MSI

Pretrained LLMs vs RAG

User: Where are the AEDs located inside the Walter library building?

I have no idea!

RAG: A way to add your “own data” to the prompt that you pass into a LLM.

Advantages:

Data privacy and protection are significant concerns
Provides up-to-date domain specific context (your own data)
Improves accuracy of generated response by grounding them in retrieved facts
Reduces hallucinations common in standalone LLMs
Allows easy updates to the knowledge base without retraining the model

Minnesota Supercomputing Institute

24 of 30

Deep Learning at MSI

Retrieval Augmented Generation (RAG)

Question → Retriever → Large Language Model → Response

User provided CONTEXT

LLM can access up-to-date and specific information beyond its training data, making it more effective!

Minnesota Supercomputing Institute

25 of 30

Deep Learning at MSI

Retrieval Augmented Generation (RAG)

Question → Retriever → Large Language Model → Response

User provided CONTEXT

Typical flow for a RAG system is:

Prompt: A user generates a query.
Embedding Model: The prompt is converted into vectors
Vector Database Search: After a user’s prompt is embedded into a vector, the system searches a vector database filled with contextually relevant data chunks.
Reranking: The retrieved data chunks are reranked to prioritize the most relevant data.
LLM: The LLM generates responses informed by the retrieved data

Minnesota Supercomputing Institute

26 of 30

Backend	Purpose
Ollama engine	serves LLM
Ollama supported models	LLM model (llama3.2)

Frontend	Purpose
Langchain	code library
HuggingFace	Use an embedding model
Chroma vector DB	Store chuncked data
unstructured[all-docs]	Parse documents
sentence-transformers	Create embedding in high dim space
Python	rag script

Deep Learning at MSI

RAG app: Software stack

Building a RAG app on Agate

Minnesota Supercomputing Institute

27 of 30

Deep Learning at MSI

Frontend Software Stack

Deep Learning at MSI

RAG app: Frontend Software Stack Install

## Use a “Terminal” from a GPU backed interactive session

# unload all loaded modules to start with a clean env

> module purge

> module load miniforge/24.3

> mamba create -n rag_env_test # will create the environment in your $HOME/.conda/envs

# if create the environment else where

>mamba create -p /scratch.global/<user_id>/rag_env_test

# activate the environment

>source activate /scratch.global/<user_id>/rag_env_test

# install python first

>mamba install python==3.11

# use pip to install langchain packages

>pip install langchain

>pip install langchain-community

>pip install langchain_huggingface langchain_ollama

# then install 3 more packages

>pip install "unstructured[all-docs]"

>pip install sentence-transformers

>pip install chromadb

# create a jupyter kernel

>python3 -m ipykernel install --user --name rag_env_test --display-name "rag_env_test"

Minnesota Supercomputing Institute

28 of 30

Deep Learning at MSI

Frontend Software Stack

Deep Learning at MSI

RAG app

Copy the rag scripts:

mkdir $HOME/RAG

cd $HOME/RAG

cp /common/tutorials/DeepLearning2/rag.py $HOME/RAG

# jupyter notebook script

cp /common/tutorials/DeepLearning2/rag.ipynb $HOME/RAG

Minnesota Supercomputing Institute

29 of 30

Deep Learning at MSI

Frontend Software Stack

Deep Learning at MSI

# Run the rag.py script

# Start the Ollama server

$HOME/Ollama/bin/ollama serve

# On a second fresh terminal, activate the rag_env_test environment

source activate rag_env_test

cd $HOME/RAG

python rag.py

Run the RAG script

RAG pipeline implementation detail

Minnesota Supercomputing Institute

30 of 30

General MSI Help (HPC systems, OpenOnDemand, Report issues, etc)

help@msi.umn.edu

Deep Learning Tutorial

Ham Lam: lamx0031@umn.edu
Mo Myat: mo000007@umn.edu

How to contact us

Minnesota Supercomputing Institute