Claude 3.5 Sonnet
Claude 3.5 Sonnet
Anthropic's mid-2024 workhorse. First released June 20, 2024; a substantially improved version shipped October 22, 2024 under the same model name. The October update added computer-use in public beta and pushed SWE-bench Verified from 33.4% to 49.0%, making it the highest-scoring publicly available model on that benchmark at the time — beating specialized agentic coding systems and OpenAI o1-preview. Now superseded by the Claude 4.x family but still widely deployed in enterprise integrations.
Context Window
200K tokens.
Benchmarks
| Benchmark | Score |
|---|---|
| SWE-bench Verified (June release) | 33.4% |
| SWE-bench Verified (October update) | 49.0% |
| GPQA (graduate reasoning) | top-tier at launch |
| HumanEval | 92.0% |
| TAU-bench retail (tool use) | 69.2% |
| TAU-bench airline (tool use) | 46.0% |
| OSWorld screenshot-only | 14.9% |
At June launch, set industry benchmarks on GPQA (graduate-level reasoning), MMLU (undergraduate knowledge), and HumanEval (coding). The October update lifted SWE-bench by 15+ percentage points and introduced computer use, where Claude outperformed all other models at 14.9% on the screenshot-only OSWorld task (next-best was 7.8%).
Pricing
- Input: $3 per million tokens
- Output: $15 per million tokens
Both the June and October versions ship at the same price and speed.
October 2024 Update
Key changes in the October 22, 2024 revision:
- SWE-bench Verified: 33.4% → 49.0%, surpassing all publicly available models including reasoning models
- Computer use (public beta): First frontier model to offer computer use via API. Claude sees a screenshot, moves a cursor, clicks, and types — controlled via the
computer_usetool - Agentic tool use: TAU-bench retail improved from 62.6% to 69.2%; TAU-bench airline from 36.0% to 46.0%
- Shipped simultaneously with Claude 3.5 Haiku
Strengths
- Agentic coding: highest SWE-bench score at its launch window; reliable at multi-file edits and test-driven debugging
- Computer use: first public-beta computer use from a frontier model; meaningful OSWorld lead over contemporaries
- Cost efficiency: $3/$15 pricing is competitive for the capability tier
- Context: 200K window handles large codebases and documents
Weaknesses
- Computer use OSWorld score (14.9%) is real-world limited — useful for specific automation tasks but not robust general GUI control
- Superseded on SWE-bench and OSWorld by claude-sonnet-4-6 and claude-opus-4-6
- No native reasoning/thinking tokens (added in claude-3-7-sonnet)
Position in the Lineup
The October 2024 Claude 3.5 Sonnet held the SWE-bench top spot for roughly four months before Claude 3.7 Sonnet. Still used in production enterprise deployments due to Bedrock and Vertex availability and stable API versioning. For new projects, the Claude 4.x family is preferred.
Related
claude-3-7-sonnet · claude-sonnet-4-6 · computer-use · extended-thinking · evals