Voice input for coding assistants just got a serious upgrade. The newly released pi-listen extension (v7.0.0) adds hold-to-talk voice transcription to Pi agents with a choice between real-time cloud streaming via Deepgram or fully offline batch processing using local ONNX models. The setup takes about two minutes and works across macOS, Windows, and Linux.
How Hold-to-Talk Voice Input Works
The interaction model is refreshingly simple: hold SPACE for at least 1.2 seconds to activate recording, speak your code or prompts naturally, then release to finalize. The system captures audio during the warmup countdown so you never miss that first word. After release, it continues recording for an additional 1.5 seconds (tail recording) to ensure your final syllable isn't clipped. With Deepgram enabled, transcripts appear in real-time as you speak via WebSocket streaming.
Cloud vs Local: The Tradeoffs
pi-listen ships with two distinct backends serving different use cases. Deepgram Nova 3 delivers live streaming with interim results appearing as you talk—56+ languages supported—but requires an internet connection and sends audio to the cloud. New users get $200 in free credit, which typically lasts six to twelve months for individual developers. Local models, by contrast, never touch the network after initial setup; they process audio entirely on-device using sherpa-onnx inference with a 2–10 second turnaround after you finish speaking.
Model Selection and Performance Ratings
Nineteen local models across five families populate the settings panel, each rated for accuracy (●●●●○ scale) and speed. Parakeet TDT v3 leads overall at 671 MB with 6.3% WER on 25 languages via auto-detection. For English-only workloads, Parakeet TDT v2 squeezes out better performance at 6.0% WER despite being slightly smaller. Whisper variants cover the broadest language support (57 total) but trade speed for accuracy—Large v3 hits the highest Whisper accuracy at 1.8 GB but crawls on CPU-only systems.
The Raspberry Pi Angle
Moonshine models target constrained hardware specifically. Moonshine v2 Tiny weighs just 43 MB with 34ms latency and carries a "Raspberry Pi friendly" badge—meaning you can run voice input without GPU acceleration on embedded-class hardware. This opens interesting possibilities for local-first development workflows or edge deployment scenarios where cloud connectivity isn't guaranteed.
Privacy and Security Architecture
For the security-conscious, pi-listen offers a clean path: disable Deepgram, pick a local model, and your audio never leaves the machine. The documentation explicitly states no telemetry collection, with API keys stored in environment variables rather than settings files where they could leak into version control. The download pipeline includes pre-checks for disk space and permissions, resumable transfers with post-completion verification, and deduplication to prevent accidental double-downloads.
Setup and Audio Tool Detection
The extension auto-detects your audio capture stack without manual configuration. Priority goes to SoX (rec) on all platforms, falling back to ffmpeg if needed, then arecord for Linux-only ALSA setups. Most developers already have one of these tools installed—running "brew install sox" or the equivalent covers the edge cases.
Key Takeaways
- Hold SPACE (≥1.2s) to record, release to transcribe—no awkward keyword triggers
- Deepgram delivers live streaming; local models offer complete offline privacy
- Device-aware recommendations score models against your RAM/CPU/GPU profile
- 19 model options spanning Parakeet, Whisper, Moonshine, SenseVoice, and GigaAM families
- MIT licensed by @baanditeagle with no telemetry or usage tracking
The Bottom Line
pi-listen is the kind of tool that makes you wonder why more coding agents don't ship with native voice input. The dual-backend approach respects both cloud-first developers who want real-time feedback and security-paranoid shops running air-gapped environments. If you're spending hours daily inside a terminal, holding SPACE to dictate code beats typing repetitive boilerplate every time—your wrists will thank you.