When Ahmet Ozel started building a document chunking and embedding API for production RAG systems, he expected the hard part to be choosing an embedding model. It wasn't. The unglamorous work of splitting documents turned out to be where retrieval quality actually moves—the kind of wins that don't require a new API subscription or a GPU upgrade.

Sentence-Aware Chunking Beats Fixed-Size Every Time

The naive approach, Ozel explains, is to split text every N characters or tokens. Simple to implement, quietly catastrophic for retrieval. Fixed-size splitting cuts sentences in half and scatters related ideas across chunk boundaries. His fix: sentence-aware chunking with configurable overlap. Each chunk stays coherent, containing a complete thought that actually represents what the embedding model needs to learn. Ozel notes this single change usually improves retrieval more than swapping out your embedding model entirely. Think about that before you spend compute budget on the latest transformer release—your chunk boundaries might be the real bottleneck.

Tables Are Their Own Problem Class

Real documents aren't just prose, and Ozel learned this the hard way. CSV and Excel files carry meaning in rows and columns, but a generic text splitter shreds records across chunk boundaries. A row representing a customer and their balance gets separated from its header context—useless when retrieved. His solution: treating tables as a distinct extraction path rather than flattening them into raw text first. Keeping rows intact preserves the semantic relationships that make tabular data meaningful in retrieval contexts. If you're indexing structured documents, this isn't optional—it's foundational.

The Embedding Model Is a Tradeoff, Not a Default

The API supports nine embedding models with BAAI/bge-m3 running in production as the default choice. Why bge-m3? Strong multilingual performance makes it practical for international deployments. But Ozel is careful to frame model selection as an explicit tradeoff: quality versus dimension size (which directly affects your vector database storage costs) versus inference latency. There is no universally correct answer, which is why it's a parameter rather than something hardcoded in the API. The right embedding model depends on your data characteristics and budget constraints—factors that vary wildly between deployments. Building this as an abstraction was deliberate, not lazy engineering.

Multilingual Preprocessing Has Sharp Edges

The most surprising lesson from Ozel's production experience: for Turkish and other multilingual text, lowercasing before chunking measurably improved retrieval with bge-m3. But lowercasing isn't universal across languages—Turkish has dotted and dotless I characters that behave differently under naive normalization, potentially corrupting words entirely. Locale-aware preprocessing mattered more than anticipated, and getting it wrong degraded results silently in ways that were difficult to catch without a dedicated evaluation set. This is the kind of bug that ships to production because there's no exception thrown—just subtly worse answers when users query in Turkish.

Treat It Like an API, Not a Script

The gap between a working notebook prototype and something your team can actually rely on comes down to infrastructure that nobody finds exciting: authentication, rate limiting, structured logging, and support for local backends (CPU/GPU/CUDA) alongside cloud deployments. None of this makes for compelling demo content, but it's what separates research experiments from production dependencies. Ozel's API runs wherever the workload needs to run, with the observability stack to debug issues when retrieval quality degrades in ways that aren't immediately obvious from error logs alone.

Key Takeaways

  • Fixed-size chunking silently destroys retrieval quality—sentence-aware splitting preserves semantic coherence
  • Tables require dedicated extraction logic; flattening them into prose loses row-level context
  • Embedding model choice involves tradeoffs between quality, dimensions (affecting storage costs), and latency
  • Locale-aware preprocessing matters for non-English text—Turkish lowercasing has specific gotchas with I characters
  • Production RAG infrastructure needs auth, rate limiting, structured logging, and flexible deployment targets

The Bottom Line

If your RAG system is returning weak answers, the chunking pipeline deserves scrutiny before you blame the language model. Sentence-aware splitting, table-aware extraction, and correct multilingual preprocessing are unglamorous changes with outsized impact—and none of them require a new API key to implement.