GPT-4 Vision: Best Image Size and Format for Accuracy
Bigger isn't better for AI vision. Here's the resolution and format sweet spot for GPT-4 Vision, Claude 3.5 Sonnet, and Gemini, plus why downscaling before upload often improves accuracy.
A surprising number of teams send 8 MB phone screenshots to GPT-4 Vision and wonder why responses are slow. The truth is that every major vision model resizes your image internally before reasoning over it — sending the original costs upload time, bandwidth, and sometimes accuracy.
The internal resize, summarized
- GPT-4 Vision: tiles images into 512×512 patches; high-detail mode caps at ~2048×2048
- Claude 3.5 Sonnet: long side capped around 1568 pixels
- Gemini 1.5/2.0: each image becomes 258 tokens regardless of size, so resolution beyond a threshold buys nothing
The practical rule: aim for ~1024 pixels on the long side and under 1 MB. Above that you're paying for pixels the model will throw away.
Format choice, by content type
Text-heavy screenshots → PNG
If the image is a screenshot of code, a contract, an error message, or an article, use PNG. Lossless compression preserves the crisp anti-aliased edges that make characters readable. Better still, run it through Screenshot to Text first and send the extracted text — it uses fewer tokens and is more accurate for dense type.
Photos and natural scenes → JPEG
For real-world photos, JPEG at quality 80 is virtually indistinguishable from the original to a vision model and a fraction of the size.
Motion or UI animation → GIF
Vision models accept GIFs but reject video files. To show a model an animation, a UI flow, or a moment from a video, convert with Video to GIF for AI Vision. Aim for under 1 MB total — the tool's target-size mode will iteratively shrink dimensions, frame rate, and palette to hit your budget.
The accuracy paradox
Counter-intuitively, downscaling can improve answers. A massive image gets aggressively resized inside the model, but JPEG artefacts from the resize can corrupt fine details. Pre-resizing to the model's native target with a clean Lanczos filter avoids that.
When to skip vision entirely
For dense text content, OCR plus a text prompt is almost always:
- Cheaper (no per-image token cost)
- Faster (no image upload)
- More accurate on long passages
- Compatible with every model, including text-only ones
Use Screenshot to Text for one-off screenshots, or Batch Document Extractor for whole folders.
Tools mentioned
Frequently asked
Does GPT-4 Vision read text from images?
Yes, well. But for dense text (code, contracts, screenshots of articles), dedicated OCR is faster, cheaper, and often more accurate.
What resolution does Claude prefer?
Claude resizes images so the long side is at most ~1568 pixels. Sending bigger files just means slower uploads with no accuracy gain.
Why does PNG work better for text?
PNG is lossless. JPEG compression artefacts smudge anti-aliased text and confuse OCR-style reading.
Keep reading
How to Prepare PDFs for ChatGPT, Claude, and Gemini
A practical guide to extracting clean, AI-ready text from PDFs — born-digital and scanned — so ChatGPT, Claude, and Gemini answer accurately and don't waste tokens on headers, footers, and page numbers.
The Best Way to Feed Long Documents to Claude (and Other Long-Context Models)
Claude's 200K-token context is generous, but you'll still want to clean, compress, and structure long documents before sending them. Here's a step-by-step playbook.
Cleaning Whisper Transcripts for AI Summaries
OpenAI Whisper, Otter, and YouTube transcripts are full of timestamps, filler words, and speaker noise. Here's how to strip them before sending to ChatGPT or Claude — and why it matters.