Pi Coding Agent Gets Hold-to-Talk Voice Input With Dual Cloud-Local Backends

Voice input for coding assistants just got a serious upgrade. The newly released pi-listen extension (v7.0.0) adds hold-to-talk voice transcription to Pi agents with a choice between real-time cloud streaming via Deepgram or fully offline batch processing using local ONNX models. The setup takes about two minutes and works across macOS, Windows, and Linux.

How Hold-to-Talk Voice Input Works

The interaction model is refreshingly simple: hold SPACE for at least 1.2 seconds to activate recording, speak your code or prompts naturally, then release to finalize. The system captures audio during the warmup countdown so you never miss that first word. After release, it continues recording for an additional 1.5 seconds (tail recording) to ensure your final syllable isn't clipped. With Deepgram enabled, transcripts appear in real-time as you speak via WebSocket streaming.

Cloud vs Local: The Tradeoffs

pi-listen ships with two distinct backends serving different use cases. Deepgram Nova 3 delivers live streaming with interim results appearing as you talk—56+ languages supported—but requires an internet connection and sends audio to the cloud. New users get $200 in free credit, which typically lasts six to twelve months for individual developers. Local models, by contrast, never touch the network after initial setup; they process audio entirely on-device using sherpa-onnx inference with a 2–10 second turnaround after you finish speaking.

Model Selection and Performance Ratings

Nineteen local models across five families populate the settings panel, each rated for accuracy (●●●●○ scale) and speed. Parakeet TDT v3 leads overall at 671 MB with 6.3% WER on 25 languages via auto-detection. For English-only workloads, Parakeet TDT v2 squeezes out better performance at 6.0% WER despite being slightly smaller. Whisper variants cover the broadest language support (57 total) but trade speed for accuracy—Large v3 hits the highest Whisper accuracy at 1.8 GB but crawls on CPU-only systems.

The Raspberry Pi Angle

Moonshine models target constrained hardware specifically. Moonshine v2 Tiny weighs just 43 MB with 34ms latency and carries a "Raspberry Pi friendly" badge—meaning you can run voice input without GPU acceleration on embedded-class hardware. This opens interesting possibilities for local-first development workflows or edge deployment scenarios where cloud connectivity isn't guaranteed.

Privacy and Security Architecture

For the security-conscious, pi-listen offers a clean path: disable Deepgram, pick a local model, and your audio never leaves the machine. The documentation explicitly states no telemetry collection, with API keys stored in environment variables rather than settings files where they could leak into version control. The download pipeline includes pre-checks for disk space and permissions, resumable transfers with post-completion verification, and deduplication to prevent accidental double-downloads.

Setup and Audio Tool Detection

The extension auto-detects your audio capture stack without manual configuration. Priority goes to SoX (rec) on all platforms, falling back to ffmpeg if needed, then arecord for Linux-only ALSA setups. Most developers already have one of these tools installed—running "brew install sox" or the equivalent covers the edge cases.

Key Takeaways

Hold SPACE (≥1.2s) to record, release to transcribe—no awkward keyword triggers
Deepgram delivers live streaming; local models offer complete offline privacy
Device-aware recommendations score models against your RAM/CPU/GPU profile
19 model options spanning Parakeet, Whisper, Moonshine, SenseVoice, and GigaAM families
MIT licensed by @baanditeagle with no telemetry or usage tracking

The Bottom Line

pi-listen is the kind of tool that makes you wonder why more coding agents don't ship with native voice input. The dual-backend approach respects both cloud-first developers who want real-time feedback and security-paranoid shops running air-gapped environments. If you're spending hours daily inside a terminal, holding SPACE to dictate code beats typing repetitive boilerplate every time—your wrists will thank you.

> Pi Coding Agent Gets Hold-to-Talk Voice Input With Dual Cloud-Local Backends