The Gap Between Loads and Works

Running Claude Code against a local model sounds straightforward until it times out before producing anything useful. That's exactly what happened on an M3 Pro when someone tried to investigate a Kubernetes incident with qwen3.6 running locally—no data leaving the machine, no API key required, just silence from the tool calls. The fix wasn't a better prompt or a different approach. It was four software changes that transformed a stalled session into one that went from investigation to an open pull request without anything crossing the network.

Why Local Matters (And Who It's For)

The use case is narrow but real: regulated environments and air-gapped clusters where data literally cannot cross the firewall. In those situations, local isn't preference—it's requirement. The trade-off is latency and model size versus privacy and flat cost. Frontier models stay remote; smaller MoE models like qwen3.6 (35B parameters, ~3B active per token) fit locally while approaching 35B quality at 14B-level resource costs. A dense 35B doesn't even fit in 36 GiB of unified memory—the MoE architecture is what makes this possible.

The Four Fixes That Unlock the Setup

The first issue hits reasoning models hard: qwen3.6's thinking chain consumed Claude Code's entire timeout budget before emitting a single tool call. Setting MAX_THINKING_TOKENS=0 fixes this—a control test showed 128 thinking tokens and 6.7 seconds with thinking enabled versus 1 token and 0.6 seconds disabled. Second, skip Ollama 0.20 entirely; version 0.24.0 is mandatory because it fixes MLX safetensor model creation and routes the think parameter correctly through the OpenAI-compatible API. Third, Modelfile templates don't work for thinking control on the MLX runner—it uses its own renderer (qwen3.5) that overrides custom templates. Control thinking via the API with think:false or MAX_THINKING_TOKENS=0 instead. Fourth, ignore the 404 storm—Claude Code probes Anthropic-native endpoints Ollama doesn't handle, but these failures are fast and harmless.

The Stack That Makes It Run

Hardware: Apple M3 Pro with 18 GPU cores and 36 GiB unified memory (~150 GB/s bandwidth). Model: qwen3.6:35b-a3b-coding-nvfp4 at 21 GB on disk, ~20 GiB resident once loaded. Runtime: Ollama 0.24.0 using the MLX backend (Apple Silicon-native path, not llama.cpp/Metal). Client: Claude Code v2.1.84 pointed at localhost:11434 with no ANTHROPIC_API_KEY set—it's that absence of a key that forces local mode instead of cloud contact.

Where Hardware Sets the Ceiling

With the fixes applied, it works—but performance is prefill-bound on this hardware. Every turn re-reads Claude Code's system prompt, tool definitions, CLAUDE.md, and conversation history before generating tokens. On the M3 Pro, that's 60–70 seconds for a 25K-token input with 90%+ of request time spent in prefill versus generation. The completed investigation-to-PR session took 34 minutes: roughly 20 in prefill, 8 in generation, 6 in tool execution. Slow, but it finished correctly. Prefill rate scales directly with memory bandwidth—M3 Max/Ultra or newer silicon raises that ceiling because the bottleneck is bandwidth, not compute. Ollama's own MLX benchmarks on M5-class hardware show sharp throughput gains over earlier chips thanks to dedicated matrix-multiply units.

The 32K Window Constraint

Context window size depends on available GPU memory after model loading. On a 36 GiB machine with Metal seeing only ~78% of total (28.1 GiB), the default window is 32,000 tokens—each token represents text the model remembers in context. Exceed that and KV cache thrashing kicks in hard: eviction bursts of 600 MiB at a time, cache hit rates collapsing to ~30%, prefill stretching to 2–3 minutes per turn, and OS memory pressure hitting 93% utilization. The practical rule on 36 GiB is keep sessions scoped. At 48 GiB, same window, no strain. At 64 GiB or more, the context window opens to 256K and that constraint dissolves entirely—though Apple's unified memory architecture means Metal only sees a fraction of total RAM regardless.

Key Takeaways

  • Reasoning models like qwen3.6 require MAX_THINKING_TOKENS=0 or they consume your entire timeout on thinking alone
  • Ollama 0.24.0 is mandatory—the MLX path has breaking differences from llama.cpp-era tooling that persist regardless of hardware
  • Modelfile templates don't transfer to MLX; control parameters via environment variables and API calls instead, not templates
  • Performance is memory-bandwidth-bound—prefill dominates on Apple Silicon at this model size
  • 36 GiB works if you keep sessions scoped inside 32K tokens; 48+ GiB removes the constraint; 64 GiB opens 256K context

The Bottom Line

This isn't a fringe setup anymore—it closed the investigation-to-PR loop entirely offline, which means air-gapped AI-assisted development is production-viable for teams with compliance constraints. If you're running Apple Silicon with adequate unified memory, Qwen3.6 via Ollama's MLX backend gives you a working Claude Code that never phones home—and that's worth trying if your stack has ever been gated by perimeter policy.