Enhancement Official

Local Whisper

Switch voice transcription to on-device whisper.cpp. No API key, no network calls, no per-message cost.

What it does

Runs entirely on-device using whisper.cpp — no data leaves your machine
No API key required and no per-message cost
Base model processes ~30 seconds of audio in under 1 second on Apple Silicon
Supports multiple model sizes for different accuracy/speed trade-offs
Auto-detects language — works with voice notes in any language

What you'll need

NanoClaw installed and running
Voice transcription skill already applied (WhatsApp channel)
macOS with Apple Silicon (M1+) recommended
Homebrew installed

Install

/use-local-whisper

How it works

The /use-local-whisper skill replaces the OpenAI Whisper API with a local whisper.cpp binary for voice message transcription. Instead of sending audio to OpenAI’s servers, voice notes are processed entirely on your machine. The agent still receives messages in the same [Voice: <transcript>] format — the change is invisible to the agent and to you.

The skill installs two dependencies via Homebrew: whisper-cpp (which provides the whisper-cli binary) and ffmpeg (for audio format conversion). It then downloads a GGML model file and modifies src/transcription.ts to call the local binary instead of the OpenAI API.

Model sizes

The skill defaults to the base model, which is a good balance of speed and accuracy. You can choose a larger model if you need better transcription quality:

Model	Size	Speed (M1, 30s audio)	Best for
Base	148 MB	< 1 second	Most use cases
Small	466 MB	~2 seconds	Better accuracy for accented speech
Medium	1.5 GB	~5 seconds	Best accuracy, multilingual

The model file lives in data/models/ and can be swapped at any time by downloading a different one and updating the WHISPER_MODEL environment variable.

Cost comparison

With the OpenAI Whisper API, transcription costs approximately $0.006 per minute of audio. That’s negligible for light use, but if you process many voice notes daily, it adds up. With local whisper, the cost is zero after the one-time model download. The trade-off is that your machine does the work instead of OpenAI’s servers.

Configuration

Two optional environment variables can be set in .env:

WHISPER_BIN — path to the whisper-cli binary. Defaults to whisper-cli (found via PATH).
WHISPER_MODEL — path to the GGML model file. Defaults to data/models/ggml-base.bin.

Troubleshooting

Transcription works in dev but not as a service. The launchd service runs with a restricted PATH that may not include /opt/homebrew/bin/. The skill checks for this and fixes it during setup, but if you’ve reinstalled Homebrew or moved binaries, verify the PATH in ~/Library/LaunchAgents/com.nanoclaw.plist.

Wrong language detected. whisper.cpp auto-detects language from the audio. To force a specific language, set WHISPER_LANG in .env and modify src/transcription.ts to pass -l $WHISPER_LANG to the binary.

Slow transcription. The base model should process 30 seconds of audio in under 1 second on M1+. If it’s slower, check CPU usage — another process may be competing for resources. Upgrading to a smaller model won’t help; downgrading will.

Tips

This skill currently works with WhatsApp only. Other channels would need their own audio-download logic before local whisper can serve them.
You can switch back to the OpenAI API at any time by reverting the changes to src/transcription.ts or re-applying the voice-transcription skill.
Very short voice notes (under 1 second) sometimes produce empty or inaccurate transcriptions. This is a limitation of the Whisper model itself, not specific to the local version.