Most projects combining speech-to-text with generative AI stop at "transcribe, hand to LLM, done." That's fine for demos, but it throws away the best part of services like AmiVoice—which returns timestamps for every single word. An ex-Java engineer building in public decided to actually use that data.
The Timestamp Exploitation Angle
The app, called Reading Speed Meter, lets users read Japanese passages aloud (up to 10 seconds), then computes two metrics: pure speaking speed in characters per minute and a "stagnation rate" measuring pause proportion. Both derive from AmiVoice's starttime/endtime fields per word—not just the transcription text. The developer, posting on DEV.to with full transparency about their AI-collaborative workflow, calls this approach a "two-stage design": code handles the math (arithmetic is cheap and precise), Claude Haiku only generates the coaching feedback wording.
Architecture: Keys Stay Server-Side
Both AmiVoice and Anthropic require API keys, so Next.js API Routes act as a thin BFF (Backend for Frontend) relay. The browser never calls external APIs directly—it records audio via MediaRecorder, POSTs to /api/recognize or /api/feedback, which hold the secrets server-side. This mirrors how a Spring @RestController reads external keys from application.yml without exposing them to clients.
The "Optimization" That Wasn't
Here's where it gets interesting: the developer added cache_control to their static system prompt for Haiku, did the break-even math, and concluded "it pays off after two uses." Except it didn't. Claude Haiku 4.5's minimum cacheable size is 4,096 tokens, and their system prompt was a few hundred. Nothing cached—verified via usage metrics showing both cache_creation_input_tokens and cache_read_input_tokens at zero.
When Written-in-Prompt ≠ Obeyed
Another gotcha: the read-aloud produces recognized text with no punctuation, yet Haiku started tacking on tips about "taking a breath at punctuation"—something not in the input. The developer added "don't mention punctuation" to the prompt, but even that wasn't 100% guaranteed. This mirrors lessons from other projects: passing tests doesn't mean behaving as intended.
Key Takeaways
- AmiVoice timestamps enable metrics beyond simple transcription—but only if you actually use them
- BFF architecture keeps API keys out of browsers without much complexity overhead
- Small models like Haiku stay effective when kept to what they're best at (wording, tone)
- Prompt caching has real minimum thresholds—do the math before claiming optimization
- Verify AI output against primary sources; even domain explanations can be backwards