Why Your Old OCR Stack Is Broken: A Three-Tier Architecture for Modern Document Processing

OCR for simple printed text has been a solved problem since the 1990s. Tesseract handles clean, high-contrast typed documents reliably. But if you're processing real business paperwork—handwritten notarial deeds, century-old land registry scans, or contracts cobbled together from multiple sources—you're dealing with a completely different engineering challenge than scanning a modern invoice.

The Business Reality Check

In legal and notarial workflows across Italy, document types vary wildly in format and condition. Rogiti (notarial deeds) often combine handwritten passages with typewriter text, dense legal language, period-specific abbreviations, and formatting conventions that change by decade. Catastali (land registry documents) feature structured form fields, stamps, and handwritten annotations layered over scanned forms. The accuracy bar is unforgiving: extraction errors on financial or legal paperwork can have real consequences, and a single property transaction might generate dozens of these documents across multiple processing stages.

Tier 1: Tesseract for Modern Typed Documents

For clean, modern typed documents—contracts generated in the last twenty years, invoices from major vendors—Tesseract 5.x with proper preprocessing does the job adequately. The key is what happens before text hits the OCR engine. A preprocessing pipeline handling deskewing, denoising, and adaptive thresholding pushes accuracy from 80-85% up to 92-96% on typical business documents. Using OpenCV's rotation matrix calculations to straighten skewed scans, Gaussian blur for noise reduction, and OTSU thresholding for contrast normalization transforms garbage inputs into usable outputs.

Tier 2: Vision Models for the Hard Stuff

Handwritten text and degraded historical documents break Tesseract completely. Character recognition on cursive Italian handwriting requires contextual understanding of letter patterns—not pixel-level template matching. The solution is vision-capable LLMs like Mistral's Pixtral-12B model, which processes document images directly with structured prompts specifying language context (Italian legal, including Latin phrases), common confusions to watch for (1/7, 0/6, 5/3 digit pairs), and output format requirements. On test sets of handwritten Italian documents, Pixtral achieves 85-90% accuracy—significantly better than any traditional OCR approach.

The Fallback Chain Architecture

Production document processing needs a decision hierarchy that routes inputs intelligently. Modern typed documents route to Tesseract first (cheapest, fastest), with vision model fallback only when confidence drops below threshold. Handwritten or low-confidence documents go straight to Pixtral. Documents failing quality checks after multiple passes escalate to Gemini Vision as an emergency option—rarely triggered but ensuring no document is completely unprocessable. The routing logic enforces a maximum of three concurrent OCR jobs to avoid API rate limiting and manage costs on tiered pricing.

Structured Extraction: Two-Stage Is Better Than One

Raw text extraction isn't the end goal for business documents—structured data extraction is. A two-stage approach proves more reliable than trying to do everything in one pass: first, normalize text through the OCR pipeline; second, run a specialized LLM pass for field extraction targeting document-specific schemas (invoice_number, date, vendor_vat, line_items[] for invoices; property_cadastral_ref, parties[], notary_name, consideration_amount for land deeds). Splitting concerns this way lets each stage specialize—OCR normalizes text representation while the extraction layer applies business logic.

Production Lessons Nobody Tells You

Queue management matters enormously when vision model calls run $0.01+ per page and take seconds of latency. Implement priority queuing with configurable concurrency limits and aggressive caching for previously processed documents. Build a feedback loop capturing human corrections—when a reviewer fixes an extraction result, store that signal. Over time you accumulate training data for fine-tuning or prompt engineering improvements. And invest heavily in document classification upfront: routing a clean typed invoice through the vision model pipeline wastes money and latency, while pushing a handwritten deed through Tesseract produces unusable garbage.

Key Takeaways

Preprocessing (deskewing, denoising, thresholding) boosts Tesseract accuracy from 80% to 95%+ on modern documents
Vision LLMs like Pixtral handle cursive handwriting with 85-90% accuracy—unreachable by traditional OCR
Two-stage extraction (OCR → raw text, then LLM → structured data) outperforms single-pass approaches
Document classification before processing saves significant cost and improves overall accuracy
Build feedback loops to capture corrections and improve over time

The Bottom Line

The days of treating OCR as a commodity problem are over for anyone processing real business documents. If you're still running everything through Tesseract because it's free, you're leaving accuracy on the table—and probably building a brittle system that'll collapse when someone feeds it a handwritten rogito from 1974.

> Why Your Old OCR Stack Is Broken: A Three-Tier Architecture for Modern Document Processing