Virtual assistants have always been only as smart as their natural language understanding layer—and for years, that meant wrestling with brittle pipelines of intent classifiers, slot extractors, and dialog managers stitched together with hand-coded rules. The problem? Users don't talk in predictable patterns. They use pronouns, skip information, change subjects mid-sentence, and expect the assistant to just figure it out. Rigid NLU systems collapse under that ambiguity. A new approach is taking over: treating large language models as a single unified engine for intent classification, slot filling, entity resolution, and dialog state tracking—all in one forward pass.

The Four Things Your Assistant Actually Needs

Production assistants demand four capabilities from their language layer: identifying what the user wants (intent), pulling out specific values like dates or locations (slots), resolving vague references into concrete data (entity resolution), and keeping track of where you are in a conversation (dialog state). Instead of maintaining separate models for each task, a capable LLM handles all four when prompted correctly. Model selection depends on your latency budget and complexity requirements. Llama 3.3 70B hits the sweet spot between speed and accuracy for fast intent classification. Qwen 3 32B is built for multilingual workloads requiring cross-language reasoning. When users throw genuinely ambiguous queries that need deep chain-of-thought disambiguation, DeepSeek R1 671B MoE or Kimi K2.6 deliver without external orchestration.

Prompt Engineering for Structured Intent and Slots

The implementation pattern is straightforward: a static system prompt defining available intents, slot types, and output rules, followed by the raw user utterance. Keep that system prompt frozen so you can cache it aggressively across requests. The model responds with structured JSON matching your schema—no fine-tuning required. Oxlo.ai supports OpenAI SDK-compatible endpoints, meaning you can drop this into existing assistant backends without touching your HTTP client. For stricter output guarantees, combine JSON mode with a partial skeleton in the system prompt that explicitly lists required fields. Models like DeepSeek V3.2 and Minimax M2.5 follow coding instructions particularly well, which translates to tighter schema adherence during slot extraction.

Handling Multi-Turn Conversations Without Going Broke

Conversational assistants accumulate history fast. A user says "Book a flight to Seattle," then follows up with "Make it refundable"—referencing the previous intent through coreference. The naive solution is passing full conversation history in every request, but on token-based providers, that inflates your bill linearly with each turn. Here's where Oxlo.ai's request-based pricing flips the economics: one flat cost per API call regardless of context length. Your system prompt could be 2,000 tokens and your conversation window could stretch to 50 messages—the price stays identical. No summarization layers required just to keep costs manageable. This fundamentally changes how you architect dialog state tracking.

Tool Calling for Entity Resolution

Some slots can't come from text alone. When a user says "Reorder my usual," the model must call an external profile API to resolve that vague reference into a concrete product ID. Oxlo.ai supports function calling on its chat endpoints, letting models decide when to invoke tools and return structured results. The pattern: define your tool schema (get_user_last_order, lookup_product, fetch_user_preferences), send it with the request, let the model choose whether to call, execute the tool, append the result, and loop back to the model for final intent resolution. This keeps prompts short and moves business logic outside the prompt layer where it belongs.

Multimodal Inputs: Speech and Vision

Modern assistants aren't text-only anymore. Oxlo.ai offers Whisper Large v3, Whisper Turbo, and Whisper Medium for audio transcription, plus Kokoro 82M for text-to-speech synthesis. Transcribe user speech first, feed the result into your NLU pipeline—no fundamental architecture change required. For vision-enabled assistants, models like Kimi VL A3B or Gemma 3 27B can parse screenshots, photos, or receipts and extract line items as structured slots. Imagine a user photographing a receipt and saying "Add these expenses to my report"—the vision model pulls the data, the NLU layer categorizes it.

Why Request-Based Pricing Changes Assistant Economics

Token-based providers charge per token, which means your bill grows with every conversation turn, every added example in your system prompt, every expanded slot definition. For assistants—an inherently long-context workload—this is a structural disadvantage. Oxlo.ai charges one flat rate per request. If you're currently on Together AI, Fireworks AI, OpenRouter, Replicate, or Anyscale and paying per token, switching to request-based pricing can cut costs dramatically as conversation depth increases. The free tier includes 60 requests daily across 16+ models with a 7-day full-access trial, so you can validate long-context behavior before committing.

Key Takeaways

  • Replace rigid NLU pipelines with a single LLM that handles intent, slots, entities, and state in one pass
  • Use request-based pricing platforms for conversational workloads where token costs scale with history
  • Implement function calling to externalize business logic—keep prompts short and focused on language understanding
  • Start testing with Llama 3.3 70B or Qwen 3 32B, scale to reasoning models like Kimi K2.6 as complexity grows

The Bottom Line

Traditional NLU pipelines were a workaround for models that couldn't handle ambiguity. We have better tools now—stop engineering around their limitations. Request-based pricing combined with unified LLM inference means building genuinely conversational assistants is finally cheaper than maintaining the old dialog-tree approach.