Vision··6 min read

GPT-4 Vision: Best Image Size and Format for Accuracy

Bigger isn't better for AI vision. Here's the resolution and format sweet spot for GPT-4 Vision, Claude 3.5 Sonnet, and Gemini, plus why downscaling before upload often improves accuracy.

A surprising number of teams send 8 MB phone screenshots to GPT-4 Vision and wonder why responses are slow. The truth is that every major vision model resizes your image internally before reasoning over it — sending the original costs upload time, bandwidth, and sometimes accuracy.

The internal resize, summarized

  • GPT-4 Vision: tiles images into 512×512 patches; high-detail mode caps at ~2048×2048
  • Claude 3.5 Sonnet: long side capped around 1568 pixels
  • Gemini 1.5/2.0: each image becomes 258 tokens regardless of size, so resolution beyond a threshold buys nothing

The practical rule: aim for ~1024 pixels on the long side and under 1 MB. Above that you're paying for pixels the model will throw away.

Format choice, by content type

Text-heavy screenshots → PNG

If the image is a screenshot of code, a contract, an error message, or an article, use PNG. Lossless compression preserves the crisp anti-aliased edges that make characters readable. Better still, run it through Screenshot to Text first and send the extracted text — it uses fewer tokens and is more accurate for dense type.

Photos and natural scenes → JPEG

For real-world photos, JPEG at quality 80 is virtually indistinguishable from the original to a vision model and a fraction of the size.

Motion or UI animation → GIF

Vision models accept GIFs but reject video files. To show a model an animation, a UI flow, or a moment from a video, convert with Video to GIF for AI Vision. Aim for under 1 MB total — the tool's target-size mode will iteratively shrink dimensions, frame rate, and palette to hit your budget.

The accuracy paradox

Counter-intuitively, downscaling can improve answers. A massive image gets aggressively resized inside the model, but JPEG artefacts from the resize can corrupt fine details. Pre-resizing to the model's native target with a clean Lanczos filter avoids that.

When to skip vision entirely

For dense text content, OCR plus a text prompt is almost always:

  • Cheaper (no per-image token cost)
  • Faster (no image upload)
  • More accurate on long passages
  • Compatible with every model, including text-only ones

Use Screenshot to Text for one-off screenshots, or Batch Document Extractor for whole folders.

Tools mentioned

Frequently asked

Does GPT-4 Vision read text from images?

Yes, well. But for dense text (code, contracts, screenshots of articles), dedicated OCR is faster, cheaper, and often more accurate.

What resolution does Claude prefer?

Claude resizes images so the long side is at most ~1568 pixels. Sending bigger files just means slower uploads with no accuracy gain.

Why does PNG work better for text?

PNG is lossless. JPEG compression artefacts smudge anti-aliased text and confuse OCR-style reading.

Keep reading