Computer Use

2 min · concept, anthropic

Computer Use

The model sees a screenshot of a screen, decides where to click or what to type, takes the action, sees the updated screenshot, and repeats. First shipped with Claude 3.5 Sonnet on October 22, 2024. claude-opus-4-6 reaches 72.7% on OSWorld and claude-sonnet-4-6 reaches 72.5% — both near or above the human baseline.

How It Works

The agent has three core tools:

Screenshot: Capture the current screen state
Mouse: Move cursor, click, drag
Keyboard: Type text, send key combinations

The loop is: observe → reason → act → observe. No special API access to applications — the model interacts with GUIs the same way a human would. This means it works with any application that has a graphical interface, without integration work.

OSWorld Benchmark

OSWorld tests agents on real computer tasks: filling forms, navigating file systems, using productivity apps, configuring settings. Tasks are scored by whether the end state is correct, not just whether the steps looked right.

claude-opus-4-6: 72.7%
claude-sonnet-4-6: 72.5%
Human baseline: 72.36%

The models are at or above parity with humans on average, though task-type performance varies significantly.

Practical Reliability

The benchmark score is an average. Real-world computer use tasks have high variance:

Reliable: Web navigation, form filling, clicking known UI elements, file operations
Unreliable: Precise pixel-level interactions, rapidly changing UIs, CAPTCHA, anything requiring understanding very small text

Do not deploy computer use in fully unsupervised contexts on tasks with irreversible consequences (deleting files, making purchases, submitting forms) without explicit confirmation steps in the agent loop.

Use Cases

Browser automation without Selenium/Playwright setup (though those are more reliable for known workflows)
Testing UI that doesn't have a programmatic API
Automating legacy applications with no API access
Research tasks across multiple web sources
GUI-based workflows in overnight-runs sessions

Access

Via Anthropic API with the computer_20251124 tool type (beta header: computer-use-2025-11-24). An earlier version used computer_20250124 (January 2025 update). Requires running a display (Xvfb in headless Linux, or real display). exe-dev instances work well since they're Linux VMs.

tools = [{"type": "computer_20251124", "name": "computer", "display_width_px": 1024, "display_height_px": 768}]

Weaknesses

Token-intensive: each screenshot is ~1000+ tokens
Slow: round-trip for screenshot → reasoning → action adds up
Brittle on pixel-precise tasks
Expensive for long automation sessions

claude-sonnet-4-6 · exe-dev · agentic-workflows · overnight-runs · evals

Sources

linked from

Assistive Technology Evals (Benchmarks)HCI in the Age of AI Claude 3.5 Sonnet Claude Haiku 4.5 Claude Opus 4.8 Claude Sonnet 4.6 Claude Sonnet 5 Gemini 2.5 Pro GPT-5.4 Perplexity Comet

Computer Use

Computer Use

How It Works

OSWorld Benchmark

Practical Reliability

Use Cases

Access

Weaknesses

Related

Sources