Computer Use
Computer Use
The model sees a screenshot of a screen, decides where to click or what to type, takes the action, sees the updated screenshot, and repeats. First shipped with Claude 3.5 Sonnet on October 22, 2024. claude-opus-4-6 reaches 72.7% on OSWorld and claude-sonnet-4-6 reaches 72.5% — both near or above the human baseline.
How It Works
The agent has three core tools:
- Screenshot: Capture the current screen state
- Mouse: Move cursor, click, drag
- Keyboard: Type text, send key combinations
The loop is: observe → reason → act → observe. No special API access to applications — the model interacts with GUIs the same way a human would. This means it works with any application that has a graphical interface, without integration work.
OSWorld Benchmark
OSWorld tests agents on real computer tasks: filling forms, navigating file systems, using productivity apps, configuring settings. Tasks are scored by whether the end state is correct, not just whether the steps looked right.
- claude-opus-4-6: 72.7%
- claude-sonnet-4-6: 72.5%
- Human baseline: 72.36%
The models are at or above parity with humans on average, though task-type performance varies significantly.
Practical Reliability
The benchmark score is an average. Real-world computer use tasks have high variance:
- Reliable: Web navigation, form filling, clicking known UI elements, file operations
- Unreliable: Precise pixel-level interactions, rapidly changing UIs, CAPTCHA, anything requiring understanding very small text
Do not deploy computer use in fully unsupervised contexts on tasks with irreversible consequences (deleting files, making purchases, submitting forms) without explicit confirmation steps in the agent loop.
Use Cases
- Browser automation without Selenium/Playwright setup (though those are more reliable for known workflows)
- Testing UI that doesn't have a programmatic API
- Automating legacy applications with no API access
- Research tasks across multiple web sources
- GUI-based workflows in overnight-runs sessions
Access
Via Anthropic API with the computer_20251124 tool type (beta header: computer-use-2025-11-24). An earlier version used computer_20250124 (January 2025 update). Requires running a display (Xvfb in headless Linux, or real display). exe-dev instances work well since they're Linux VMs.
tools = [{"type": "computer_20251124", "name": "computer", "display_width_px": 1024, "display_height_px": 768}]
Weaknesses
- Token-intensive: each screenshot is ~1000+ tokens
- Slow: round-trip for screenshot → reasoning → action adds up
- Brittle on pixel-precise tasks
- Expensive for long automation sessions
Related
claude-sonnet-4-6 · exe-dev · agentic-workflows · overnight-runs · evals