1 of 16

Running LLM Infra Locally: A Practical Guide

ChatGPT-like power in your living room!

Fifth Elephant 2025

2 of 16

Why self host?

Control

Privacy

Cost

Run whichever model, whenever you want.

Data does not leave your machine.

You control the cost, cheaper than frontier models.

1

2

3

3 of 16

Setup

  • Hardware
      • Radeon™ RX 7800 XT
      • 16GB VRAM
  • Software
      • Ollama
      • Tailscale

3

4 of 16

Hardware

4

  • CPU-only (entry level):�
    • 8+ core CPU (e.g., Ryzen 7, i7 12th Gen)�
    • 32GB RAM minimum (models like Mistral-7B Q4 can still work)�
    • SSD with 100–200 GB free (for model weights)�
  • GPU-accelerated (recommended):�
    • GPU with ≥12GB VRAM (e.g., RTX 3060, 6700XT)�
    • Models like LLaMA-3 8B Q4 or Mistral 7B Q4/5 run smoothly here�
    • 64GB system RAM for better context support and multitasking

Rule of Thumb:

1B parameters ≈ 2GB memory

Constraint Note:

Quantized 7B models run on 8–12GB VRAM GPUs.

13B+ needs ≥24GB VRAM or offloading tricks (e.g., ggml+CPU fallback).

5 of 16

Inference Engines

Engine

Backend

Key Features

Ollama

gguf/llama.cpp

1-command model serving, model fetching, basic API

LM Studio

llama.cpp

GUI-based, ideal for testing & demos

vLLM

CUDA

Super fast, better batching, not quantization-friendly yet

6 of 16

Local LLM Tooling

6

7 of 16

Model Families

7

Model Family

Sizes

Best For

Notes

Mistral

7B, 12B, 22B, 24B

General purpose, chat, reasoning

Fast, strong few-shot ability, open weights

Qwen (Alibaba)

3B–72B

Multilingual chat, reasoning

Strong performance on benchmarks

LLaMA 2 / LLaMA 3

7-70B

Reasoning, dialogue, complex tasks

Gemma3

1B-27B

Vision

128K context window

140+ languages

Code LLMs (CodeLLaMA, StarCoder)

7B–15B

SQL generation, code tasks

8 of 16

Quantization

8

Format

Bits

Trade-offs

Q2_K

2

Extreme compression, loses coherence

Q4_K

4

Good balance, usable for 7B on 8GB VRAM

Q5_K

5

Better output, needs ~10–12GB VRAM

Q8_0

8

Near full precision, high quality, memory-heavy

9 of 16

Known Gotchas & Mitigations

9

Issue

Cause

Mitigation

OOM errors

Model too big for VRAM

Try lower quantization (Q4_K), use --numa flags, offload to CPU

Slow startup time

Model loading & disk IO

Keep models on SSD, warm start if possible

Low-quality responses

Over-quantized models (e.g., Q2)

Stick to Q4/Q5 for coherent output

Context window limits

Small models have smaller token context

Use models with 8K+ token support (Mistral, LLaMA 3)

Multi-user lag

Thread contention or single-process engines

Try vLLM or TGI for concurrent serving

10 of 16

10

11 of 16

Prompt

11

12 of 16

Live Demo

13 of 16

Bottleneck: Hardware

13

14 of 16

14

15 of 16

15

16 of 16

Thank You!