1 of 16

Running LLM Infra Locally: A Practical Guide

ChatGPT-like power in your living room!

Fifth Elephant 2025

2 of 16

Why self host?

Control

Privacy

Cost

Run whichever model, whenever you want.

Data does not leave your machine.

You control the cost, cheaper than frontier models.

1

2

3

3 of 16

Setup

Hardware

Radeon™ RX 7800 XT
16GB VRAM

Software

Ollama
Tailscale

3

4 of 16

Hardware

4

CPU-only (entry level):�

8+ core CPU (e.g., Ryzen 7, i7 12th Gen)�
32GB RAM minimum (models like Mistral-7B Q4 can still work)�
SSD with 100–200 GB free (for model weights)�

GPU-accelerated (recommended):�

GPU with ≥12GB VRAM (e.g., RTX 3060, 6700XT)�
Models like LLaMA-3 8B Q4 or Mistral 7B Q4/5 run smoothly here�
64GB system RAM for better context support and multitasking

Rule of Thumb:

1B parameters ≈ 2GB memory

Constraint Note:

Quantized 7B models run on 8–12GB VRAM GPUs.

13B+ needs ≥24GB VRAM or offloading tricks (e.g., ggml+CPU fallback).

5 of 16

Inference Engines

Engine	Backend	Key Features
Ollama	gguf/llama.cpp	1-command model serving, model fetching, basic API
LM Studio	llama.cpp	GUI-based, ideal for testing & demos
vLLM	CUDA	Super fast, better batching, not quantization-friendly yet

6 of 16

Local LLM Tooling

6

7 of 16

Model Families

7

Model Family	Sizes	Best For	Notes
Mistral	7B, 12B, 22B, 24B	General purpose, chat, reasoning	Fast, strong few-shot ability, open weights
Qwen (Alibaba)	3B–72B	Multilingual chat, reasoning	Strong performance on benchmarks
LLaMA 2 / LLaMA 3	7-70B	Reasoning, dialogue, complex tasks
Gemma3	1B-27B	Vision	128K context window 140+ languages
Code LLMs (CodeLLaMA, StarCoder)	7B–15B	SQL generation, code tasks

8 of 16

Quantization

8

Format	Bits	Trade-offs
Q2_K	2	Extreme compression, loses coherence
Q4_K	4	Good balance, usable for 7B on 8GB VRAM
Q5_K	5	Better output, needs ~10–12GB VRAM
Q8_0	8	Near full precision, high quality, memory-heavy

9 of 16

Known Gotchas & Mitigations

9

Issue	Cause	Mitigation
OOM errors	Model too big for VRAM	Try lower quantization (Q4_K), use --numa flags, offload to CPU
Slow startup time	Model loading & disk IO	Keep models on SSD, warm start if possible
Low-quality responses	Over-quantized models (e.g., Q2)	Stick to Q4/Q5 for coherent output
Context window limits	Small models have smaller token context	Use models with 8K+ token support (Mistral, LLaMA 3)
Multi-user lag	Thread contention or single-process engines	Try vLLM or TGI for concurrent serving

10 of 16

10

11 of 16

Prompt

11

12 of 16

Live Demo

13 of 16

Bottleneck: Hardware

13

14 of 16

14

15 of 16

15

16 of 16

Thank You!