GPT-4o
GPT-4o
OpenAI's first natively omnimodal model — the "o" stands for omni. Released May 13, 2024. Prior GPT-4 voice pipelines stitched together separate speech-to-text, text, and text-to-speech models; GPT-4o processes and generates text, audio, image, and video in a single unified architecture, removing the latency overhead of pipeline handoffs.
Architecture and Capabilities
Unified end-to-end model: a single set of weights handles all modalities. Accepts text, audio, image, and video input; produces text, audio, and image output. This enables features like tone and emotion detection in voice, and reasoning about visual and auditory context simultaneously.
Audio response latency of ~320ms, matching typical human conversational response time (210–320ms). Previous GPT-4 voice pipelines had ~5.4s end-to-end latency.
Context Window
128K tokens.
Benchmarks
| Benchmark | Score |
|---|---|
| SWE-bench Verified | ~33.2% |
| HumanEval (coding) | 90.2% |
| MMLU | 88.7% |
SWE-bench Verified measures real-world GitHub issue resolution. 33.2% was competitive at launch but has since been surpassed by newer models (GPT-4.1 at 54.6%, GPT-5 at 74.9%).
Pricing
Launch pricing (May 2024): $5.00/$15.00 per million input/output tokens. Price cut in October 2024 to $2.50/$10.00 per million tokens — 50% reduction. Current rates remain at $2.50/$10.00.
Variants and Timeline
- GPT-4o — May 13, 2024, base release
- GPT-4o mini — July 18, 2024; smaller, cheaper version that replaced GPT-3.5 Turbo as the default ChatGPT model
- GPT-4o Audio / Realtime Preview — October 2024; dedicated audio API variants (
gpt-4o-audio-preview,gpt-4o-realtime-preview) enabling low-latency voice applications
Strengths
- Multimodal reasoning: handles interleaved text, image, and audio in a single context
- Voice: real-time conversational speed, emotion detection, response in multiple voices/accents
- Cost: price-per-token dropped significantly after launch; competitive for high-volume workloads
- Ecosystem: default model in ChatGPT, widely deployed across API integrations
Weaknesses
- Coding capability at 33% SWE-bench is substantially behind current frontier; not the right pick for complex agentic coding
- No native web search (requires tool use)
- Superseded on most benchmarks by GPT-4.1, o1, and the GPT-5 family
Position in the Lineup
GPT-4o remains widely deployed in production applications but is no longer frontier. Superseded by o1 for reasoning tasks, GPT-4.1 for coding, and the GPT-5 family for general capability. Still a pragmatic choice for multimodal workloads and cost-sensitive API use cases.