When I started working full-time with Claude Code, I found myself wanting to speak to it quite frequently. But I couldn’t find the right tooling
- Press a shortcut to start recording
- Talk
- Press another shortcut to stop recording
- Press enter to send
The first prototype came together in a day with Node.js. The reliability and speed of the transcription was amazing. This pushed me to make something real from it and here I’m introducing a Rust based local CLI for speech-to-text.
Thanks to NVIDIA’s Parakeet model, Para-speak is working amazingly for AI assistance coding and I’m open sourcing the CLI tool!
OpenAI’s Whisper app delivers good accuracy, but it feels slow and requires manual steps to get text where you need it.
Other desktop applications I tried came with cluttered UIs and didn’t provide the flexibility I was looking for - some wouldn’t even let me try them without jumping through hoops.
While trying othe prod, I discovered CtrlSpeak, an open-source project that was implementing something close to what I was looking for. This inspired me to try building my own solution with NVIDIA’s Parakeet model.
OpenAI’s Whisper app delivers good accuracy, but it feels slow and requires manual steps to get text where you need it.
Other desktop applications I tried came with cluttered UIs and didn’t provide the flexibility I was looking for - some wouldn’t even let me try them without jumping through hoops.
While trying othe prod, I discovered CtrlSpeak, an open-source project that was implementing something close to what I was looking for. This inspired me to try building my own solution with NVIDIA’s Parakeet model.
Plug and Play
Para-speak
In Ukrainian, “Pora” means “It’s time.”
Since the vowel sounds “o” and “a” are so close, I often pronounce it as “para”.
The name of the project is meant to capture this idea - it’s time to speak.
In Ukrainian, “Pora” means “It’s time.”
Since the vowel sounds “o” and “a” are so close, I often pronounce it as “para”.
The name of the project is meant to capture this idea - it’s time to speak.
For now, running the program requires one time setup to initialize Python environment and download the Parakeet model.
# Set up environment and download model (first time only)
cargo run -p verify-cli
All behavior is configurable through environment variables.
Be default, use the following shortcuts:
- Start recording:
ControlLeft + ControlLeft
(double tap) - Stop recording:
ControlLeft
- Cancel recording:
Escape + Escape
(double tap) - Pause/resume: No default shortcut
Make sure double Control is not conflicting with MacOS dictation shortcut at
Keyboard > Dictation > Shortcut
Make sure double Control is not conflicting with MacOS dictation shortcut at
Keyboard > Dictation > Shortcut
Running the CLI
# Note: On first run, macOS will prompt for Accessibility permissions (for shortcuts)
# and Microphone access (for recording)
./para-speak
# Run in a debug mode
./para-speak -d
The keypress events still pass through to your system, so choose shortcuts that won’t conflict with your other applications. rdev grab
is not working reliably and sometimes the CLI needs to be restarted to work properly, listen
is working more predictably but some shortcuts might insert characters alonside with triggering the shortcut.
Architecture
Para-speak is built in Rust, handling the majority of functionality—audio capture, keyboard shortcuts, system integration, and the CLI interface.
Python is used specifically for ML inference with the Parakeet MLX model through PyO3 bindings.
The Rust implementation focuses on speed and efficiency. Every part of the audio pipeline and system interaction is optimized for minimal latency. Feedback on Rust code is very welcome as it’s one of my first complete Rust projects.
When idle, Para-speak uses minimal resources—around 10MB of RAM on a MacBook M1 Pro.
Cross-platform support
Shortcut System & Extensibility
The shortcut system offers different ways to trigger actions:
- Single keys:
F1
,Escape
,ControlLeft
- Combinations:
CmdLeft+Shift+Y
,Ctrl+Alt+A
- Double-taps:
double(ControlLeft, 300)
(300 is a delay between taps in ms)
Any combination, divided by ;
, can be used for any shortcut - start, stop, pause, or cancel.
The system is optimized to minimize resource usage: when idle, it only listens for the start recording shortcut. Once recording begins, other shortcuts become active. For sequences and combinations, Para-speak only listens for the first key, activating full detection only when needed.
Para-speak uses a controller system that makes it easy to extend functionality. Controllers can be enabled through environment variables and get notified of recording events to execute custom actions.
The Spotify controller is one example - it adjusts music volume during recording. The same pattern can be used to build any type of asyncronous integration, or trigger any automation one might need after the recording is transcribed.
Configuration
Para-speak uses environment variables for all configuration. Create a .env.local
file in the root of the project directory:
# Keyboard shortcuts
PARA_START_KEYS="double(ControlLeft, 300); CommandLeft+ShiftLeft+KeyY"
PARA_STOP_KEYS="ControlLeft; CommandLeft+ShiftLeft+KeyY"
PARA_CANCEL_KEYS="double(Escape, 300)"
PARA_PAUSE_KEYS="CommandLeft+Alt+Shift+KeyU"
# Core functionality
PARA_PASTE=true # Auto-paste transcribed text at cursor
# Spotify integration
PARA_SPOTIFY_RECORDING_VOLUME=30 # Set Spotify to specific volume (0-100)
PARA_SPOTIFY_REDUCE_BY=50 # OR reduce volume by amount (0-100)
# Transcription behavior
PARA_TRANSCRIBE_ON_PAUSE=true # Experimental: transcribe when pausing (not just on stop)
# Advanced
PARA_SHORTCUT_RESOLUTION_DELAY_MS=50 # Delay for resolving shortcut conflicts
PARA_MEMORY_MONITOR=true # Enable memory usage reporting
# Debugging
PARA_DEBUG=true # Enable debug mode with verbose output
Check the README for detailed documentation.