If you're running Claude Code with hardcoded MCP tool definitions, you're probably torching 50,000+ tokens before your user finishes their first keystroke. A developer pattern called discovery-driven architecture is changing that calculation—swapping static model lists for dynamic queries like nvidia_list_foundation_models, and shaving 40% off token consumption in the process.
The Problem With Static Schemas
Traditional MCP setup means dumping every possible model, endpoint, and configuration into your system prompt. If you've got 50 tool definitions with full JSON schemas, you're looking at roughly 10-15k tokens just for infrastructure—and that's before accounting for KV cache overhead multiplying everything by four. It's the equivalent of pre-loading an entire phone book when you only need one number. Hardcoded prompts also break silently when providers rename endpoints or deprecate features, creating debugging nightmares that surface at the worst possible times.
Discovery-Driven Architecture in Action
The fix isn't theoretical—it's shipping now. The NVIDIA API Catalog MCP implements discovery-driven patterns with tools like nvidia_list_foundation_models, which lets your agent query what's actually available right now rather than trusting stale documentation. Instead of telling Claude Code 'you have access to Llama3, Nemotron, and 47 other models,' you give it one function call: ask what's live. The agent receives a real-time dump of accessible paths and adjusts subsequent calls based on actual availability. No more token waste on inactive models, no more silent failures when infrastructure changes underneath you.
Proactive Quota Management
Beyond discovery, the NVIDIA Catalog MCP includes nvidia_check_token_quota—enabling agents to check their own constraints before firing off heavy inference tasks. If quota runs low mid-session, the agent can autonomously switch to a smaller model or pause and alert you. This moves governance from orchestrator code into the agent itself, which is exactly where it should live for production workloads.
Key Takeaways
- Hardcoded 50-model definitions consume tokens even for inactive models in your region
- Discovery-driven tools like nvidia_list_foundation_models query availability at runtime
- Token reduction starts at 40% on startup alone—before any real work begins
- Proactive quota checking with nvidia_check_token_quota prevents mid-run failures
Getting Started
Setup is straightforward. Install via CLI: 'claude mcp add nvidia-catalog --url https://vinkius.com/mcp/nvidia-api-catalog' or add to claude.json with your NVIDIA_API_KEY. Then update CLAUDE.md to replace static model lists with a discovery-first directive: call nvidia_list_foundation_models before any inference, then select the best available option. The MCP ecosystem has crossed 13,000 servers, and X launched its official MCP server in April 2025—signaling that dynamic endpoints are becoming table stakes for production agents.
The Bottom Line
Static MCPs belong in demos. For anything shipping to production, discovery-driven architecture isn't optional—it's the only sane approach when you're paying per token. Hardcoding is a liability that compounds as your infrastructure evolves. Give your agent the tools to ask what's actually there, and let it figure out the landscape at runtime.