Enhancement Official

Voice Transcription

Transcribe voice messages automatically using OpenAI Whisper. Your agent reads voice notes as text.

What it does

Automatic transcription of WhatsApp voice messages
Uses OpenAI's Whisper API for high-accuracy speech-to-text
Voice notes appear as text in the format [Voice: <transcript>]
Low cost — approximately $0.006 per minute of audio
Seamless integration with the WhatsApp channel

What you'll need

NanoClaw installed and running
WhatsApp channel configured
OpenAI API key with funded account

Install

/add-voice-transcription

How it works

The /add-voice-transcription skill adds automatic voice message transcription to NanoClaw. When someone sends a voice note in WhatsApp, NanoClaw intercepts it, sends the audio to OpenAI’s Whisper API, and delivers the text transcription to the agent. The agent sees the message in the format [Voice: <transcript>] and can respond to it like any text message.

The skill applies two changes to your codebase. First, it creates a transcription.ts module that handles the Whisper API call — downloading the voice note, sending it to OpenAI, and returning the text. Second, it modifies the WhatsApp channel code to detect voice messages and route them through the transcription module before they reach the agent.

During setup, you provide an OpenAI API key. The skill saves it to your environment and verifies that it works by making a test API call. It also confirms that your OpenAI account has funds available, since Whisper is a paid API.

Cost

Whisper transcription costs approximately $0.006 per minute of audio. A typical voice note of 15-30 seconds costs a fraction of a cent. Even heavy use (dozens of voice messages per day) stays well under a dollar per month. The cost is billed by OpenAI, not NanoClaw.

Testing

After the skill applies, you can test it immediately. Send a voice note in your registered WhatsApp chat and watch the logs. You’ll see entries showing the audio being downloaded, sent to Whisper, and the transcription being passed to the agent. The agent responds as if you’d typed the message.

If transcription fails for a specific message — due to background noise, very short audio, or an unsupported format — the logs show the failure reason. The agent receives a note that transcription failed rather than silently dropping the message.

Local alternative

If you prefer not to send audio to OpenAI’s servers, NanoClaw also supports local transcription using whisper.cpp on Apple Silicon. The /use-local-whisper skill switches from the API to a local model that runs entirely on your machine. Transcription quality is comparable for English, and there’s no per-message cost. The trade-off is that the initial model download is around 1.5 GB, and transcription is slower on machines without dedicated ML hardware.

Tips

Voice transcription currently works with WhatsApp only. Support for voice messages on other channels depends on the platform’s API providing audio access.
Whisper handles most languages well, not just English. If you receive voice notes in other languages, they’ll be transcribed accurately.
Very short voice notes (under 1 second) sometimes produce empty or inaccurate transcriptions. This is a limitation of the Whisper model itself.
The OpenAI API key is stored in your .env file and passed to the container as an environment variable. It is never sent anywhere other than OpenAI’s API endpoint.