Claude 3.5 Sonnet

Claude 3.5 Sonnet

Anthropic's mid-2024 workhorse. First released June 20, 2024; a substantially improved version shipped October 22, 2024 under the same model name. The October update added computer-use in public beta and pushed SWE-bench Verified from 33.4% to 49.0%, making it the highest-scoring publicly available model on that benchmark at the time — beating specialized agentic coding systems and OpenAI o1-preview. Now superseded by the Claude 4.x family but still widely deployed in enterprise integrations.

Context Window

200K tokens.

Benchmarks

Benchmark Score
SWE-bench Verified (June release) 33.4%
SWE-bench Verified (October update) 49.0%
GPQA (graduate reasoning) top-tier at launch
HumanEval 92.0%
TAU-bench retail (tool use) 69.2%
TAU-bench airline (tool use) 46.0%
OSWorld screenshot-only 14.9%

At June launch, set industry benchmarks on GPQA (graduate-level reasoning), MMLU (undergraduate knowledge), and HumanEval (coding). The October update lifted SWE-bench by 15+ percentage points and introduced computer use, where Claude outperformed all other models at 14.9% on the screenshot-only OSWorld task (next-best was 7.8%).

Pricing

  • Input: $3 per million tokens
  • Output: $15 per million tokens

Both the June and October versions ship at the same price and speed.

October 2024 Update

Key changes in the October 22, 2024 revision:

  • SWE-bench Verified: 33.4% → 49.0%, surpassing all publicly available models including reasoning models
  • Computer use (public beta): First frontier model to offer computer use via API. Claude sees a screenshot, moves a cursor, clicks, and types — controlled via the computer_use tool
  • Agentic tool use: TAU-bench retail improved from 62.6% to 69.2%; TAU-bench airline from 36.0% to 46.0%
  • Shipped simultaneously with Claude 3.5 Haiku

Strengths

  • Agentic coding: highest SWE-bench score at its launch window; reliable at multi-file edits and test-driven debugging
  • Computer use: first public-beta computer use from a frontier model; meaningful OSWorld lead over contemporaries
  • Cost efficiency: $3/$15 pricing is competitive for the capability tier
  • Context: 200K window handles large codebases and documents

Weaknesses

  • Computer use OSWorld score (14.9%) is real-world limited — useful for specific automation tasks but not robust general GUI control
  • Superseded on SWE-bench and OSWorld by claude-sonnet-4-6 and claude-opus-4-6
  • No native reasoning/thinking tokens (added in claude-3-7-sonnet)

Position in the Lineup

The October 2024 Claude 3.5 Sonnet held the SWE-bench top spot for roughly four months before Claude 3.7 Sonnet. Still used in production enterprise deployments due to Bedrock and Vertex availability and stable API versioning. For new projects, the Claude 4.x family is preferred.

Related

claude-3-7-sonnet · claude-sonnet-4-6 · computer-use · extended-thinking · evals

Sources