If you've got an active Claude subscription burning a hole in your pocket, there's now a way to unlock its full potential across your existing tooling. CLI2API, posted on GitHub by developer zhusq20, wraps the locally-logged-in claude CLI as a drop-in OpenAI-compatible HTTP service running on localhost:8765. Point any OpenAI SDK at it—LangChain, LlamaIndex, open-webui, LobeChat—and you're off to the races with Claude under the hood.

How It Works

The server takes incoming requests formatted as OpenAI API calls and translates them into claude CLI invocations. By default, it concatenates your message history into a User: ... / Assistant: ... block and pipes it through the CLI. Model selection happens via a flexible alias system: "default" uses your account's recommended model, "best" or "opus" gets you the strongest reasoning, "sonnet" handles daily coding tasks, and "haiku" keeps costs low for simple queries. You can also pin specific versions with full IDs like claude-sonnet-4-6, or tap extended context windows via sonnet[1m] and opus[1m] for long conversations and large document work.

Two Paths for Multi-Turn Conversations

CLI2API offers two distinct approaches to maintaining conversational context. The default stateless method mirrors the real OpenAI API exactly—the client sends full message history with each request, giving you complete control over pruning, forking, or splicing in tool results. This works universally with every OpenAI-compatible framework but means you're resending everything on every call, which gets expensive fast and doesn't benefit from prompt cache hits since the CLI sees one monolithic user message rather than a structured conversation. The more interesting option is session-based mode: pass a UUID as session_id through extra_body, and the server invokes claude with --session-id or --resume depending on whether it's the first call in that thread. History lives inside the CLI process itself, only your latest user message travels over the wire, and the prompt prefix hits cache reliably across turns. The demo script prints cache_read_input_tokens and cache_creation_input_tokens per turn so you can watch those savings accumulate.

Concurrency, Retries, and Tuning Knobs

The server handles heavy workloads through environment variables that take effect on restart. CLAUDE_MAX_CONCURRENCY controls how many simultaneous claude subprocesses run (default: 4), while CLAUDE_TIMEOUT caps each call at 300 seconds before the process gets killed. On failure or timeout, CLAUDE_MAX_RETRIES retries up to five times with exponential backoff starting at two seconds and capping at sixty. You can also point it at a custom CLI binary via CLAUDE_BIN if you've got something non-standard in your PATH.

Known Trade-offs

No solution is without compromises. Temperature, top_p, max_tokens, and similar OpenAI knobs are silently ignored since the underlying CLI doesn't expose them—the server accepts these fields purely for protocol compatibility. Streaming is pseudo-streaming: the server collects the full response first, then slices it into SSE chunks. This means you can't watch tokens appear token-by-token as they're generated, but it does make retry reliable mid-generation. There's also no authentication—it's bound to 127.0.0.1 only—and if you need network exposure, you'll want a reverse proxy with token auth in front.

Key Takeaways

  • CLI2API bridges Claude subscriptions to the entire OpenAI SDK ecosystem without API key gymnastics
  • Session-based mode unlocks real prompt cache benefits for long-running conversations
  • Configurable concurrency and retry logic handles production workloads out of the box
  • Temperature and streaming limitations may rule it out for latency-sensitive applications

The Bottom Line

This is a clever hack that fills a real gap—developers who've already paid for Claude access but need OpenAI compatibility in their existing stacks now have a zero-friction bridge. Just don't mistake pseudo-streaming for the real thing or expect temperature controls to work. For batch processing, tooling integration, and conversational agents where you control the client side? This is exactly what the ecosystem was missing.