From Weights to Production: the Real Work Behind LLM Deployment on Cloud Infrastructure

When teams at tech companies talk about "deploying an LLM," the conversation usually starts with excitement — picking a model, downloading weights from Hugging Face, maybe spinning up a quick demo in Colab. What gets glossed over is everything that happens between that first inference call and a production endpoint that doesn't fall over when traffic spikes at 9 AM on Monday. A deep-dive posted this week on DEV.to breaks down the operational reality of running large language models at scale, and it's required reading for anyone building with AI in 2026.

The Self-Hosting Path: Maximum Control, Maximum Headache

Provisioning dedicated GPU instances from AWS, Google Cloud, or Azure gives you raw hardware access. Teams typically start with a g6e or p5 instance on AWS (A3 VMs on GCP, NC-series on Azure) and layer in an open inference engine like vLLM, TGI, or TensorRT-LLM. The article walks through a minimal vLLM deployment — one command to spin up the container, specify tensor-parallel size across eight GPUs, set max context length, and you're live. Sounds simple. It isn't. You now own OS patching, NVIDIA driver upgrades, spot instance interruption handling, and scaling logic that doesn't explode during long-context workloads where memory pressure spikes unpredictably.

Kubernetes Gets Messy Fast

For teams running multiple models or needing replication, Kubernetes with GPU operators becomes the next evolution. KServe InferenceService CRDs or Ray Serve clusters handle model composition across multi-node pipelines — powerful stuff on paper. In practice, GPU health checks, pod preemption, and HPA thresholds that measure GPU utilization instead of request queue depth create new failure modes every week. Cold starts remain a problem if you scale to zero, but keeping replicas warm during off-peak hours burns budget nobody wants to explain to finance.

Managed APIs Flip the Operational Model

The article makes a compelling case for managed inference services like Oxlo.ai as an alternative that removes the infrastructure layer entirely. The key differentiator is pricing structure: instead of metering tokens (where long-context and agentic workloads with large input prompts can balloon costs unpredictably), Oxlo.ai charges a flat rate per API request regardless of prompt length. For teams building retrieval-augmented generation pipelines, multi-turn agents, or anything sending large retrieved contexts to the model, this pricing model eliminates a variable that has sunk plenty of AI product budgets. Oxlo.ai hosts over 45 open-source and proprietary models across seven categories — general-purpose LLMs like Llama 3.3 70B and Qwen 3 32B, reasoning models including DeepSeek R1 671B MoE and Kimi K2.6, plus specialized endpoints for code, vision, audio, and embeddings. Crucially, it's fully OpenAI SDK-compatible, meaning migration from another provider requires only changing the base URL. No GPU drivers to configure, no cold starts on popular models, consistent latency from the first request of the day.

Hybrid Architectures Are Winning in Practice

The article's most useful framing is the hybrid deployment pattern: small, latency-sensitive models self-hosted at the edge or on private cloud for tasks where round-trip latency matters, while large reasoning tasks, long-context summarization, and image generation get offloaded to managed APIs. A RAG pipeline might run a local embedding model for vector search, then call Oxlo.ai for final generation with a 128K context window — without watching token counters tick up on every retrieved document chunk.

Key Takeaways

Self-hosting on cloud VMs gives you control over quantization, speculative decoding, and custom scheduling, but you're on the hook for the entire infrastructure lifecycle
Kubernetes orchestration adds replication and rolling updates at the cost of new failure modes around GPU health checks and scaling thresholds
Request-based managed APIs like Oxlo.ai remove pricing unpredictability from long-context and agentic workloads where input tokens dominate
Hybrid deployment — local models for latency-sensitive tasks, managed APIs for heavy lifting — is the pattern emerging as production best practice

The Bottom Line

The gap between "model works in a notebook" and "production inference endpoint that doesn't embarrass you on a launch day" is wider than most engineering roadmaps account for. Managed inference services aren't giving up control — they're making an honest trade: less time managing GPU clusters, more time on prompt engineering and product work where the actual value lives.

> From Weights to Production: the Real Work Behind LLM Deployment on Cloud Infrastructure