Claude 3.5 Sonnet

2 min · model, anthropic, agentic

Claude 3.5 Sonnet

Anthropic's mid-2024 workhorse. First released June 20, 2024; a substantially improved version shipped October 22, 2024 under the same model name. The October update added computer-use in public beta and pushed SWE-bench Verified from 33.4% to 49.0%, making it the highest-scoring publicly available model on that benchmark at the time — beating specialized agentic coding systems and OpenAI o1-preview. Now superseded by the Claude 4.x family but still widely deployed in enterprise integrations.

Context Window

200K tokens.

Benchmarks

Benchmark	Score
SWE-bench Verified (June release)	33.4%
SWE-bench Verified (October update)	49.0%
GPQA (graduate reasoning)	top-tier at launch
HumanEval	92.0%
TAU-bench retail (tool use)	69.2%
TAU-bench airline (tool use)	46.0%
OSWorld screenshot-only	14.9%

At June launch, set industry benchmarks on GPQA (graduate-level reasoning), MMLU (undergraduate knowledge), and HumanEval (coding). The October update lifted SWE-bench by 15+ percentage points and introduced computer use, where Claude outperformed all other models at 14.9% on the screenshot-only OSWorld task (next-best was 7.8%).

Pricing

Input: $3 per million tokens
Output: $15 per million tokens

Both the June and October versions ship at the same price and speed.

October 2024 Update

Key changes in the October 22, 2024 revision:

SWE-bench Verified: 33.4% → 49.0%, surpassing all publicly available models including reasoning models
Computer use (public beta): First frontier model to offer computer use via API. Claude sees a screenshot, moves a cursor, clicks, and types — controlled via the computer_use tool
Agentic tool use: TAU-bench retail improved from 62.6% to 69.2%; TAU-bench airline from 36.0% to 46.0%
Shipped simultaneously with Claude 3.5 Haiku

Strengths

Agentic coding: highest SWE-bench score at its launch window; reliable at multi-file edits and test-driven debugging
Computer use: first public-beta computer use from a frontier model; meaningful OSWorld lead over contemporaries
Cost efficiency: $3/$15 pricing is competitive for the capability tier
Context: 200K window handles large codebases and documents

Weaknesses

Computer use OSWorld score (14.9%) is real-world limited — useful for specific automation tasks but not robust general GUI control
Superseded on SWE-bench and OSWorld by claude-sonnet-4-6 and claude-opus-4-6
No native reasoning/thinking tokens (added in claude-3-7-sonnet)

Position in the Lineup

The October 2024 Claude 3.5 Sonnet held the SWE-bench top spot for roughly four months before Claude 3.7 Sonnet. Still used in production enterprise deployments due to Bedrock and Vertex availability and stable API versioning. For new projects, the Claude 4.x family is preferred.

claude-3-7-sonnet · claude-sonnet-4-6 · computer-use · extended-thinking · evals

Claude 3.5 Sonnet

Claude 3.5 Sonnet

Context Window

Benchmarks

Pricing

October 2024 Update

Strengths

Weaknesses

Position in the Lineup

Related

Sources