Local Whisper
Switch voice transcription to on-device whisper.cpp. No API key, no network calls, no per-message cost.
What it does
- Runs entirely on-device using whisper.cpp — no data leaves your machine
- No API key required and no per-message cost
- Base model processes ~30 seconds of audio in under 1 second on Apple Silicon
- Supports multiple model sizes for different accuracy/speed trade-offs
- Auto-detects language — works with voice notes in any language
What you'll need
- NanoClaw installed and running
- Voice transcription skill already applied (WhatsApp channel)
- macOS with Apple Silicon (M1+) recommended
- Homebrew installed
Install
/use-local-whisper How it works
The /use-local-whisper skill replaces the OpenAI Whisper API with a local whisper.cpp binary for voice message transcription. Instead of sending audio to OpenAI’s servers, voice notes are processed entirely on your machine. The agent still receives messages in the same [Voice: <transcript>] format — the change is invisible to the agent and to you.
The skill installs two dependencies via Homebrew: whisper-cpp (which provides the whisper-cli binary) and ffmpeg (for audio format conversion). It then downloads a GGML model file and modifies src/transcription.ts to call the local binary instead of the OpenAI API.
Model sizes
The skill defaults to the base model, which is a good balance of speed and accuracy. You can choose a larger model if you need better transcription quality:
| Model | Size | Speed (M1, 30s audio) | Best for |
|---|---|---|---|
| Base | 148 MB | < 1 second | Most use cases |
| Small | 466 MB | ~2 seconds | Better accuracy for accented speech |
| Medium | 1.5 GB | ~5 seconds | Best accuracy, multilingual |
The model file lives in data/models/ and can be swapped at any time by downloading a different one and updating the WHISPER_MODEL environment variable.
Cost comparison
With the OpenAI Whisper API, transcription costs approximately $0.006 per minute of audio. That’s negligible for light use, but if you process many voice notes daily, it adds up. With local whisper, the cost is zero after the one-time model download. The trade-off is that your machine does the work instead of OpenAI’s servers.
Configuration
Two optional environment variables can be set in .env:
WHISPER_BIN— path to the whisper-cli binary. Defaults towhisper-cli(found via PATH).WHISPER_MODEL— path to the GGML model file. Defaults todata/models/ggml-base.bin.
Troubleshooting
Transcription works in dev but not as a service. The launchd service runs with a restricted PATH that may not include /opt/homebrew/bin/. The skill checks for this and fixes it during setup, but if you’ve reinstalled Homebrew or moved binaries, verify the PATH in ~/Library/LaunchAgents/com.nanoclaw.plist.
Wrong language detected. whisper.cpp auto-detects language from the audio. To force a specific language, set WHISPER_LANG in .env and modify src/transcription.ts to pass -l $WHISPER_LANG to the binary.
Slow transcription. The base model should process 30 seconds of audio in under 1 second on M1+. If it’s slower, check CPU usage — another process may be competing for resources. Upgrading to a smaller model won’t help; downgrading will.
Tips
- This skill currently works with WhatsApp only. Other channels would need their own audio-download logic before local whisper can serve them.
- You can switch back to the OpenAI API at any time by reverting the changes to
src/transcription.tsor re-applying the voice-transcription skill. - Very short voice notes (under 1 second) sometimes produce empty or inaccurate transcriptions. This is a limitation of the Whisper model itself, not specific to the local version.