Computer Use

Computer Use

The model sees a screenshot of a screen, decides where to click or what to type, takes the action, sees the updated screenshot, and repeats. First shipped with Claude 3.5 Sonnet on October 22, 2024. claude-opus-4-6 reaches 72.7% on OSWorld and claude-sonnet-4-6 reaches 72.5% — both near or above the human baseline.

How It Works

The agent has three core tools:

  • Screenshot: Capture the current screen state
  • Mouse: Move cursor, click, drag
  • Keyboard: Type text, send key combinations

The loop is: observe → reason → act → observe. No special API access to applications — the model interacts with GUIs the same way a human would. This means it works with any application that has a graphical interface, without integration work.

OSWorld Benchmark

OSWorld tests agents on real computer tasks: filling forms, navigating file systems, using productivity apps, configuring settings. Tasks are scored by whether the end state is correct, not just whether the steps looked right.

The models are at or above parity with humans on average, though task-type performance varies significantly.

Practical Reliability

The benchmark score is an average. Real-world computer use tasks have high variance:

  • Reliable: Web navigation, form filling, clicking known UI elements, file operations
  • Unreliable: Precise pixel-level interactions, rapidly changing UIs, CAPTCHA, anything requiring understanding very small text

Do not deploy computer use in fully unsupervised contexts on tasks with irreversible consequences (deleting files, making purchases, submitting forms) without explicit confirmation steps in the agent loop.

Use Cases

  • Browser automation without Selenium/Playwright setup (though those are more reliable for known workflows)
  • Testing UI that doesn't have a programmatic API
  • Automating legacy applications with no API access
  • Research tasks across multiple web sources
  • GUI-based workflows in overnight-runs sessions

Access

Via Anthropic API with the computer_20251124 tool type (beta header: computer-use-2025-11-24). An earlier version used computer_20250124 (January 2025 update). Requires running a display (Xvfb in headless Linux, or real display). exe-dev instances work well since they're Linux VMs.

tools = [{"type": "computer_20251124", "name": "computer", "display_width_px": 1024, "display_height_px": 768}]

Weaknesses

  • Token-intensive: each screenshot is ~1000+ tokens
  • Slow: round-trip for screenshot → reasoning → action adds up
  • Brittle on pixel-precise tasks
  • Expensive for long automation sessions

Related

claude-sonnet-4-6 · exe-dev · agentic-workflows · overnight-runs · evals

Sources