GPT-4o

GPT-4o

OpenAI's first natively omnimodal model — the "o" stands for omni. Released May 13, 2024. Prior GPT-4 voice pipelines stitched together separate speech-to-text, text, and text-to-speech models; GPT-4o processes and generates text, audio, image, and video in a single unified architecture, removing the latency overhead of pipeline handoffs.

Architecture and Capabilities

Unified end-to-end model: a single set of weights handles all modalities. Accepts text, audio, image, and video input; produces text, audio, and image output. This enables features like tone and emotion detection in voice, and reasoning about visual and auditory context simultaneously.

Audio response latency of ~320ms, matching typical human conversational response time (210–320ms). Previous GPT-4 voice pipelines had ~5.4s end-to-end latency.

Context Window

128K tokens.

Benchmarks

Benchmark Score
SWE-bench Verified ~33.2%
HumanEval (coding) 90.2%
MMLU 88.7%

SWE-bench Verified measures real-world GitHub issue resolution. 33.2% was competitive at launch but has since been surpassed by newer models (GPT-4.1 at 54.6%, GPT-5 at 74.9%).

Pricing

Launch pricing (May 2024): $5.00/$15.00 per million input/output tokens. Price cut in October 2024 to $2.50/$10.00 per million tokens — 50% reduction. Current rates remain at $2.50/$10.00.

Variants and Timeline

  • GPT-4o — May 13, 2024, base release
  • GPT-4o mini — July 18, 2024; smaller, cheaper version that replaced GPT-3.5 Turbo as the default ChatGPT model
  • GPT-4o Audio / Realtime Preview — October 2024; dedicated audio API variants (gpt-4o-audio-preview, gpt-4o-realtime-preview) enabling low-latency voice applications

Strengths

  • Multimodal reasoning: handles interleaved text, image, and audio in a single context
  • Voice: real-time conversational speed, emotion detection, response in multiple voices/accents
  • Cost: price-per-token dropped significantly after launch; competitive for high-volume workloads
  • Ecosystem: default model in ChatGPT, widely deployed across API integrations

Weaknesses

  • Coding capability at 33% SWE-bench is substantially behind current frontier; not the right pick for complex agentic coding
  • No native web search (requires tool use)
  • Superseded on most benchmarks by GPT-4.1, o1, and the GPT-5 family

Position in the Lineup

GPT-4o remains widely deployed in production applications but is no longer frontier. Superseded by o1 for reasoning tasks, GPT-4.1 for coding, and the GPT-5 family for general capability. Still a pragmatic choice for multimodal workloads and cost-sensitive API use cases.

Related

o1 · evals

Sources