Running LLM Infra Locally: A Practical Guide
ChatGPT-like power in your living room!
Fifth Elephant 2025
Why self host?
Control
Privacy
Cost
Run whichever model, whenever you want.
Data does not leave your machine.
You control the cost, cheaper than frontier models.
1
2
3
Setup
3
Hardware
4
Rule of Thumb:
1B parameters ≈ 2GB memory
Constraint Note:
Quantized 7B models run on 8–12GB VRAM GPUs.
13B+ needs ≥24GB VRAM or offloading tricks (e.g., ggml+CPU fallback).
Inference Engines
Engine | Backend | Key Features |
Ollama | gguf/llama.cpp | 1-command model serving, model fetching, basic API |
LM Studio | llama.cpp | GUI-based, ideal for testing & demos |
vLLM | CUDA | Super fast, better batching, not quantization-friendly yet |
Local LLM Tooling
6
Model Families
7
Model Family | Sizes | Best For | Notes |
Mistral | 7B, 12B, 22B, 24B | General purpose, chat, reasoning | Fast, strong few-shot ability, open weights |
Qwen (Alibaba) | 3B–72B | Multilingual chat, reasoning | Strong performance on benchmarks |
LLaMA 2 / LLaMA 3 | 7-70B | Reasoning, dialogue, complex tasks | |
Gemma3 | 1B-27B | Vision | 128K context window 140+ languages |
Code LLMs (CodeLLaMA, StarCoder) | 7B–15B | SQL generation, code tasks | |
Quantization
8
Format | Bits | Trade-offs |
Q2_K | 2 | Extreme compression, loses coherence |
Q4_K | 4 | Good balance, usable for 7B on 8GB VRAM |
Q5_K | 5 | Better output, needs ~10–12GB VRAM |
Q8_0 | 8 | Near full precision, high quality, memory-heavy |
Known Gotchas & Mitigations
9
Issue | Cause | Mitigation |
OOM errors | Model too big for VRAM | Try lower quantization (Q4_K), use --numa flags, offload to CPU |
Slow startup time | Model loading & disk IO | Keep models on SSD, warm start if possible |
Low-quality responses | Over-quantized models (e.g., Q2) | Stick to Q4/Q5 for coherent output |
Context window limits | Small models have smaller token context | Use models with 8K+ token support (Mistral, LLaMA 3) |
Multi-user lag | Thread contention or single-process engines | Try vLLM or TGI for concurrent serving |
10
Prompt
11
Live Demo
Bottleneck: Hardware
13
14
15
Thank You!