Enhancement Official

Image Vision

Give the agent eyes. Images sent via WhatsApp are processed and passed to Claude as visual content.

What it does

Automatically processes images sent as WhatsApp attachments
Resizes images with sharp for optimal token usage
Passes images to Claude as base64-encoded multimodal content blocks
Saves processed images to the group workspace for reference
Agent sees and understands image content — no extra prompting needed

What you'll need

NanoClaw installed and running
WhatsApp channel configured

Install

/add-image-vision

How it works

The /add-image-vision skill lets NanoClaw agents see images. When someone sends a photo in a registered WhatsApp chat, NanoClaw downloads the image, resizes it using sharp (to keep token costs reasonable), saves it to the group workspace, and passes it to Claude as a multimodal content block. The agent sees the image alongside the text message and can describe, analyze, or respond to what’s in it.

The skill modifies two parts of the codebase. On the host side, it updates the WhatsApp channel to detect image attachments, download them, and process them through sharp for resizing. On the container side, it updates the agent runner to include processed images as visual content blocks in the message sent to Claude.

After the skill applies, you rebuild the container image and restart the service. From that point on, any image sent in a registered chat is automatically processed — no configuration needed.

What the agent can do with images

Claude’s vision capabilities apply fully. The agent can:

Describe what’s in a photo
Read text from screenshots or documents
Identify objects, people’s expressions, or scenes
Compare multiple images sent in sequence
Answer questions about visual content

The agent doesn’t need to be told “look at this image” — it receives the image as part of the message and responds accordingly.

Image processing

Images are resized before being sent to Claude to balance quality against token usage. The sharp library handles format conversion and downscaling. Original images are saved to the group’s workspace directory in case the agent needs to reference them later.

The processing pipeline handles common WhatsApp image formats including JPEG, PNG, and WebP. If an image can’t be processed (corrupt file, unsupported format), the agent receives a note that an image was attached but couldn’t be read, rather than silently dropping it.

Troubleshooting

Agent doesn’t mention the image. Check container logs for “Loaded image” messages. If missing, the agent-runner source may not have been synced to existing group caches. Copy the updated source files to each group’s agent-runner-src/ directory and restart.

“Image - download failed.” The WhatsApp media download timed out or failed. This usually happens on unstable connections. Resending the image typically works.

“Image - processing failed.” Sharp may not be installed correctly. Run npm ls sharp to verify it’s present. Sharp uses native bindings, so a clean npm install may be needed if you’ve changed Node.js versions.

Tips

Image vision currently works with WhatsApp only. Other channels would need their own image-download logic.
Sending very large images (10+ MB) works but takes longer to download and process. WhatsApp compresses most photos before sending, so this is rarely an issue in practice.
The agent can handle multiple images in a single message. Each one is processed and passed as a separate content block.