Take Back Your AI: Self-Hosting LLMs Is Now a Docker Compose Away

Cloud AI is convenient until you see the bill. API costs pile up silently, rate limits interrupt work mid-flow, and every prompt lives forever on someone else's servers — a data sovereignty nightmare for anyone paying attention. The solution has been lurking in plain sight: Docker plus Ollama plus Open WebUI gives you a complete private AI stack that spins up in minutes and runs on hardware you probably already own. The architecture is brutally simple by design. Ollama handles model downloads and inference via its REST API on port 11434, wrapping the battle-tested llama.cpp engine for efficient CPU and GPU execution. Open WebUI provides the ChatGPT-style interface that connects to your local Ollama instance — no cloud dependency whatsoever. A single docker-compose.yml ties everything together with persistent volumes so your models survive container restarts. The author has benchmarked this stack on hardware ranging from a decade-old Xeon server to a $150 BMAX Pro mini PC with 24 GB RAM. Hardware requirements are refreshingly modest. Models like llama3.2:3b (roughly 2 GB quantized) need only 4–6 GB of RAM and run at acceptable speeds even on integrated graphics. The article's model selection table shows the full range — from quick-answer models like gemma3:4b to heavy hitters like qwen3:30b requiring 24–32 GB for complex reasoning tasks. Key insight: LLM inference is memory-bandwidth-bound, not compute-bound, meaning a used server with 128 GB of DDR3 can serve 26B models that rival cloud offerings — all on CPU.

GPU Acceleration Options

Performance jumps significantly if you have a discrete GPU available. Nvidia users install the nvidia-container-toolkit and uncomment the deploy block in the compose file to enable CUDA passthrough — expect token generation to climb from ~10 tok/s on CPU to 50+ tok/s with GPU acceleration. AMD GPU owners use the ollama/ollama:rocm image with ROCm drivers, though support is hardware-dependent. Intel integrated graphics users get Vulkan support out of the box with Mesa drivers installed on the host.

API Surface and Integrations

This is where local hosting becomes strategic. Ollama exposes both a native REST API for full control and OpenAI-compatible endpoints at /v1 — meaning VS Code extensions, LangChain, n8n workflows, Hermes Agent, and any tool configured for OpenAI works by changing one base_url to localhost:11434 with no API key required. Structured JSON output via schema definitions lets you force models into predictable shapes for automation pipelines, a feature the author highlights as essential for teaching workflows like generating mark schemes and grading code. Security considerations are critical. By default Ollama binds to 127.0.0.1 only — localhost access prevents network exposure. The article explicitly warns against setting OLLAMA_HOST=0.0.0.0 since there's no built-in authentication on the local API. Remote access requires a reverse proxy with auth (Nginx basic auth, Cloudflare Tunnel with Access policies) or VPN solutions like Tailscale to keep the instance off the public internet.

The Bottom Line

This stack represents a fundamental shift in AI accessibility — running capable models at home is no longer a GPU enthusiast's hobby but an operational reality for anyone comfortable with Docker. The privacy and cost benefits compound over time, and the OpenAI-compatible API surface means switching away from cloud providers doesn't require rewiring your tooling. For developers and power users watching where their data flows, this three-piece stack is the infrastructure equivalent of self-hosting your email instead of surrendering it to Big Tech.

> Take Back Your AI: Self-Hosting LLMs Is Now a Docker Compose Away

GPU Acceleration Options

API Surface and Integrations

The Bottom Line

> RELATED DISPATCHES