Log — Roguelite Labs

Log — Roguelite Labs https://roguelitelabs.xyz/log Releases, ideas, and moments that changed how everyone's work gets done. en-us NVIDIA Nemotron 3 Ultra NVIDIA enters the open-weights frontier with a 550B-parameter MoE running 55B active parameters per token at 90% sparsity — the same parameter-efficiency pattern as DeepSeek V3, but scaled to where only Western proprietary labs had gone. Scores 48 on Artificial Analysis's Intelligence Index, the highest ever for a US open-weights model, clearing Gemma 4 31B (39), Nemotron 3 Super (36), and gpt-oss-120b (33) by comfortable margins; the ceiling is still Kimi K2.6 at 54, but the gap is now close enough to be a product decision rather than a capability gap. Inference speed is the operational story: over 300 tokens per second on pre-release endpoints, 3–6× faster than Chinese competitors serving comparable open models. Ships in BF16 with NVFP4 quantization, available via Hugging Face, OpenRouter, and as a NIM microservice. The strategic move here is deliberate vertical integration — NVIDIA now controls the chip, the inference stack, and the model weights simultaneously, which lets it price NIM as a full-stack service rather than just API access to raw compute. The chip maker becoming a model publisher is the single most consequential structural shift in the AI supply chain since DeepSeek showed training costs were compressible. https://huggingface.co/nvidia/Nemotron-3-Ultra-550B NVIDIA Nemotron 3 Ultra-2026 · Q2 2026 · Q2 Trump AI Executive Order The administration signs 'Promoting Advanced Artificial Intelligence Innovation and Security' — the first substantive US AI governance action since Biden's 2023 EO, but structurally opposite: voluntary where Biden was mandatory, promotion-first where Biden was risk-first. The core mechanism is a pre-release voluntary review window: developers can submit advanced frontier models for up to 30 days of classified government evaluation before public release, administered through the NSA. The order explicitly prohibits reading this as a licensing or preclearance requirement, which was the line industry had drawn. Beyond the model review framework, the order mandates three concrete actions within 30 days: an AI Cybersecurity Clearinghouse coordinated by Treasury and CISA for cross-sector vulnerability scanning and patch prioritization; AI-enabled defensive tools deployed to National Security Systems; and OMB identifying grant funding redirectable toward AI vulnerability detection. Section 4 adds a criminal enforcement directive targeting AI-assisted unauthorized computer access. The net effect is a policy posture that treats AI capability as a national security asset to be protected and deployed, not regulated — the inverse of the EU AI Act's precautionary logic, and a strong signal that US domestic AI development will face minimal structural friction for at least the next two to four years. https://www.whitehouse.gov/presidential-actions/2026/06/promoting-advanced-artificial-intelligence-innovation-and-security/ Trump AI Executive Order-2026 · Q2 2026 · Q2 Microsoft MAI-Thinking-1 + Project Polaris Microsoft ships MAI-Thinking-1 at Build 2026 — its first in-house reasoning model, a sparse MoE with approximately 1T total and 35B active parameters and a 256K token context window, trained entirely on commercially licensed data with no OpenAI distillation in the pipeline. The benchmark numbers are serious: 97.0% on AIME 2025, 94.5% on AIME 2026, and competitive with Claude Opus 4.6 on SWE-Bench Pro. Both MAI-Thinking-1 and the companion Project Polaris coding model run natively on Microsoft's Maia 200 accelerator — a TSMC 3nm chip with 216GB HBM3e at 7 TB/s bandwidth and over 10 petaFLOPS at FP4, delivering 30% better performance per dollar than GPU alternatives for these specific workloads. Project Polaris replaces GPT-4 Turbo across all GitHub Copilot subscriptions starting August 2026, which is the product move that makes the MAI launch matter commercially. The broader picture at Build: Microsoft announced seven in-house MAI models total, a signal that the OpenAI partnership is now one input among many rather than the whole model layer. Microsoft remains OpenAI's largest infrastructure partner and continues to distribute GPT-5 — but it is no longer model-dependent, which fundamentally changes the negotiating dynamics of that relationship going forward. https://build.microsoft.com Microsoft MAI-Thinking-1 + Project Polaris-2026 · Q2 2026 · Q2 Alphabet raises $84.75B for AI infrastructure Alphabet prices an $84.75B equity capital raise on June 2 — upsized from $80B after investor demand overwhelmed the original terms within 24 hours of announcement, which is itself the signal. The structure: a $30B underwritten public offering in Class A and Class C shares (at $355.20 and $351.80 respectively), a $40B at-the-market program beginning Q3 2026, and a $10B private placement anchored by Berkshire Hathaway. Projected 2026 capex: $180–190B, revised up $5B in April when Gemini usage growth — nearly 900 million monthly active users by May — outpaced infrastructure buildout. That capex figure is 6× the 2022 level and 2× the prior year; no non-state actor has ever committed to infrastructure spend at this scale in a single year. No product announcement accompanies the raise, because none is needed — the capital is being deployed against a capacity backlog, not a roadmap. What this signals for the competitive landscape: Alphabet is no longer just defending its search business, it is structurally betting that owning the compute layer is worth diluting shareholders at the largest scale in public market history. If Gemini 3.x holds its performance lead and TPU supply chains remain intact, this is the decision that wins the infrastructure race. If it doesn't, this is the decision that explains the next decade of balance-sheet reconstruction. https://seekingalpha.com/article/4910900-alphabet-80-billion-equity-raise-explained Alphabet raises $84.75B for AI infrastructure-2026 · Q2 2026 · Q2 MiniMax M3 MiniMax releases M3 on June 1 with a sparse attention architecture (MSA — MiniMax Sparse Attention) that changes the economics of long-context inference: a lightweight index branch scans incoming tokens, selects which KV blocks actually require full attention, and runs the expensive computation only on those — delivering 9× faster prefill and 15× faster decoding at 1M-token context versus M2, at 1/20th the per-token compute. The distinction from DeepSeek's Multi-head Latent Attention is that MSA preserves full precision rather than compressing the KV cache, which matters for retrieval-heavy tasks. Benchmark results: 59.0% on SWE-Bench Pro (ahead of GPT-5.5, behind Opus 4.8's 69.2%), 66.0% on Terminal-Bench 2.1, and 83.5 on BrowseComp. API is live on release; model weights are promised within 10 days, but training code and inference operators are withheld, making the open-weights claim partial — a pattern that has become standard for Chinese labs trying to capture open-source credibility without enabling full replication. The benchmark gap between Chinese open-weights and Western proprietary frontier has now closed to within single digits on several tasks that matter for production coding agents. That is the story, whatever the caveats about the open label. https://the-decoder.com/minimax-m3-open-weight-model-with-a-million-token-context-challenges-proprietary-leaders/ MiniMax M3-2026 · Q2 2026 · Q2 Anthropic IPO — confidential S-1 filed Anthropic confidentially submits a draft S-1 to the SEC on June 1, 2026 — three days after closing the $65B Series H, at a $965B post-money valuation and $47B annualized run-rate revenue (up from roughly $10B the prior year, a ~5× annual growth rate). The filing is the first from a major pure-play AI safety lab and the first in what is now being called the $3T AI IPO race, alongside OpenAI and xAI. No pricing range, share structure, or listing venue is set — the confidential filing process gives Anthropic 15 weeks of SEC review time before it has to go public with the prospectus. The valuation at filing makes it the most valuable US company to ever file an S-1. The comparables problem is real: there is no good public comparable for a frontier AI lab with safety as a constitutional constraint, $47B ARR growing at 5×, and a product mix spanning consumer subscriptions, enterprise API, and hyperscaler resale. If it prices near $965B, it enters the public market above IBM, AMD, and Salesforce — all on a revenue multiple that assumes the growth rate is durable. If it discounts significantly, it will still be the largest pure-play AI IPO in history and will set the floor for every subsequent AI company valuation. Either outcome rewrites the comparables table. https://www.cbsnews.com/news/anthropic-ipo-confidential-filing-claude-ai/ Anthropic IPO — confidential S-1 filed-2026 · Q2 2026 · Q2 Claude Opus 4.8 Anthropic's next flagship ships with a deliberately quiet announcement, but the improvements are structural rather than incremental. The headlining honesty change — four times less likely than 4.7 to let code flaws pass without remark — sounds like a footnote until you've run a 200-file refactor and discovered your previous model had been silently accepting broken tests to avoid friction. On external benchmarks: 84% on Online-Mind2Web (strongest computer-use and browser-agent score Anthropic has measured), first model to break 10% on their Legal Agent Benchmark, and leading on Finance Agent v2. Dynamic Workflows in Claude Code — available in research preview for Enterprise/Team/Max — let a single Opus session spin up hundreds of parallel subagents for large-scale operations like codebase migrations across hundreds of thousands of lines; this is the first time the orchestration layer and execution layer collapse into a single model context. Effort Controls add explicit depth control at the API level: higher settings trigger more frequent and deeper thinking passes, lower settings cut rate-limit pressure — the same task, different compute budget, caller's choice. Fast mode is 3× cheaper than the prior generation's equivalent. Pricing holds at $5/$25 per million tokens. Databricks reports 61% cheaper token cost than Opus 4.7 for equivalent intelligence on their benchmarks, which is the implementation story: the model costs the same but does more per token. https://www.anthropic.com/news/claude-opus-4-8 Claude Opus 4.8-2026 · Q2 2026 · Q2 Anthropic Series H — $65B at $965B Anthropic closes a $65B Series H on the same day Opus 4.8 ships — the timing is deliberate: close the largest funding round in private company history alongside a model release that justifies the valuation. Lead investors are Altimeter, Dragoneer, Greenoaks, and Sequoia, with co-leads including Capital Group, Coatue, D1, GIC, ICONIQ, and XN. The infrastructure commitments are the operational story: Amazon commits $5B plus access to up to 5 gigawatts of new compute capacity; Google and Broadcom contribute 5 gigawatts of next-generation TPU access; SpaceX provides GPU capacity access through Colossus 1 and Colossus 2. Strategic positions from Micron, Samsung, and SK hynix lock memory supply directly alongside compute — a supply-chain hedge that no prior AI lab has secured at this breadth. Annualized revenue crossed $47B in May 2026, up ~5× year-over-year. Post-money valuation of $965B surpasses OpenAI's $852B from March's round. Three days after close, Anthropic confidentially files an S-1 — the fundraise and the IPO filing are a single coordinated sequence, not separate events. https://techcrunch.com/2026/05/28/anthropic-raises-65-billion-nears-1t-valuation-ahead-of-ipo/ Anthropic Series H — $65B at $965B-2026 · Q2 2026 · Q2 Gemini 3.5 Flash Gemini 3.5 Flash goes GA on May 19 and immediately reshapes how the tier system is supposed to work. On the agentic benchmarks that matter for production deployments: 76.2% on Terminal-Bench 2.1 (beats Gemini 3.1 Pro at 70.3% and Claude Opus 4.7 at 66.1%, trails GPT-5.5 at 78.2%); 83.6% on MCP Atlas (leads Claude Opus 4.7 at 79.1%, Gemini 3.1 Pro at 78.2%, GPT-5.5 at 75.3%); 56.5% on Toolathlon; 84.2% on CharXiv Reasoning. Four times the inference speed of comparable frontier-class models. Context window: 1,048,576 input tokens. Pricing: $1.50/$9 per million tokens — 3× the prior Flash generation's rate, an intentional signal that this isn't a cost-optimized cutdown. That price positioning matters: Google is telling the market that Flash now belongs in the same tier as last year's Pro models, not as a discount tier. The competitive implication is significant — if an efficiency model can consistently beat Pro-class models on the metrics that govern real-world agent deployments, the definition of "frontier" stops being about raw benchmark scores on static academic tasks and starts being about latency-adjusted agentic performance. Google is betting the pricing on being right about that framing. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/ Gemini 3.5 Flash-2026 · Q2 2026 · Q2 Claude Opus 4.7 Anthropic ships Opus 4.7, the first model in the Claude 4 family designed explicitly for sustained, multi-day autonomous operation. Extended thinking and tool use are now deeply integrated — the model decides when to think longer rather than requiring a mode switch from the caller. SWE-bench Verified reaches 81.2%. Classified internally as 'Level 3+' on Anthropic's safety scale, the highest deployment classification to date. The framing: the gap between 'assistant' and 'agent' is now mostly an infrastructure problem, not a model problem. https://www.anthropic.com/claude/opus Claude Opus 4.7-2026 · Q2 2026 · Q2 Google I/O 2026 Google's annual developer conference bets the keynote on AI and for once the products justify it. Gemini 2.5 Ultra debuts as the lab's largest model: 10-million-token context, top scores on every current benchmark suite, native real-time audio output. Project Astra — the persistent ambient AI assistant that sees and remembers your physical environment — moves from demo to limited preview on Pixel 10. Android 16 ships with on-device Gemini Nano across all OEM tiers, not just Pixel. NotebookLM gains real-time collaboration. Google's compute advantage is finally showing up in products rather than research papers. https://io.google/2026 Google I/O 2026-2026 · Q2 2026 · Q2 Microsoft Build 2026 Build goes deep on 'AI-native Windows': Copilot+ PCs now run Phi-4 locally with a new Windows AI APIs surface that any third-party app can call, no cloud required. GitHub Copilot Workspace ships GA — a browser-native agent that reads issues, branches, writes code, opens PRs, and triggers CI without touching an IDE. Azure announces a $35B data-center expansion citing enterprise demand it currently cannot fulfill. The number on Copilot for Microsoft 365: 400M seats under active deployment. The enterprise AI backlog is larger than any published ARR figure suggests. https://build.microsoft.com Microsoft Build 2026-2026 · Q2 2026 · Q2 NVIDIA Computex — Blackwell Ultra Jensen Huang keynotes Computex with Blackwell Ultra: 1.5× the dense FLOPS of standard Blackwell at the same TDP, achieved via a process node shrink and a die-to-die interconnect redesign. NVLink 6 doubles multi-GPU bandwidth, making 1,000+ GPU inference pods practical without custom networking. The H300 is framed explicitly as the mandatory upgrade for frontier inference by end of 2026 — H100 clusters will be cost-uncompetitive for large-batch workloads. Simultaneously, NIM microservices go GA on every major cloud, decoupling Blackwell access from owning the hardware. The moat deepens. https://www.nvidia.com/en-us/events/computex NVIDIA Computex — Blackwell Ultra-2026 · Q2 2026 · Q2 Mistral Large 3 Mistral releases Large 3 — a 200B+ dense model under a non-commercial research license, their heaviest open-weight release. Trained on a filtered corpus with deliberate multilingual parity, it outperforms every prior open model on EU-language benchmarks and closes the gap with frontier proprietary models on reasoning. The launch is paired with a commercial API tier at $3/$12 per million tokens: Mistral is quietly becoming a closed-model company while keeping the open-weight brand. The European exception to the open-source rollback — for now. https://mistral.ai/news Mistral Large 3-2026 · Q2 2026 · Q2 EU AI Act — high-risk provisions in force The EU AI Act's high-risk category prohibitions go live — the first binding AI regulation with real enforcement teeth in a major economy. Opening enforcement targets emotion recognition software used in hiring and social credit-adjacent scoring systems, not frontier models. But the compliance burden on general-purpose AI providers is now active: mandatory transparency reports, incident disclosure within 72 hours, and GPAI registry filings for all models above 10²⁵ FLOPs of training compute. Anthropic, OpenAI, and Google file on day one. Several mid-tier providers miss the deadline, triggering the first formal investigations. Every AI procurement team in Europe acquires a compliance budget. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai EU AI Act — high-risk provisions in force-2026 · Q2 2026 · Q2 Perplexity Comet Perplexity ships Comet, a standalone browser built around an AI agent that operates the web on your behalf — booking, research, form submission, comparison shopping. It is the first consumer product to ship computer-use as the primary interface rather than a feature. The go-to-market is direct: Comet replaces Chrome for tasks you'd currently do yourself. Within 48 hours of launch it has a 500K-person waitlist. The framing is aggressive and probably premature, but it's the clearest public test of whether consumers want an agent operating their browser or just answering questions about it. https://perplexity.ai Perplexity Comet-2026 · Q2 2026 · Q2 GPT-5.5 OpenAI ships GPT-5.5 on April 23 with an explicit focus on agentic work: coding, computer use, and knowledge-work automation. It scores 82.7% on Terminal-Bench 2.0 and nearly doubles GPT-5.4 on FrontierMath Tier 4 — the hardest tier — while holding the same per-token latency. Available in ChatGPT and Codex for Plus, Pro, Business, and Enterprise. The benchmark jump on math reasoning is the headline; the latency hold is the engineering story. https://openai.com/index/introducing-gpt-5-5/ GPT-5.5-2026 · Q2 2026 · Q2 DeepSeek V4 preview DeepSeek drops V4-Flash and V4-Pro on April 24 under MIT — 1M-token context, dual Thinking/Non-Thinking modes. V4-Pro runs 1.6T total / 49B active parameters and uses only 27% of the inference FLOPs of V3.2 at 1M tokens. A year after the original DeepSeek moment reset expectations on what efficiency could look like, V4 pushes the efficiency-per-parameter curve again. Legacy aliases deprecate July 24. https://api-docs.deepseek.com/news/news260424 DeepSeek V4 preview-2026 · Q2 2026 · Q2 Project Prometheus closes $10B Jeff Bezos's physical-AI lab closes a $10B round at a $38B valuation, led by BlackRock and JPMorgan. Prometheus builds AI systems that learn through real-world interaction and physics rather than text or images — a deliberate contrast to the web-trained foundation model stack. Bezos is reportedly structuring a broader industrial-AI holding company targeting up to $100B. Physical AI is the next frontier-model arms race, and Bezos is running the first lap. https://www.bloomberg.com/news/articles/2026-04-23/bezos-s-physical-ai-lab-has-closed-round-at-38-billion-value Project Prometheus closes $10B-2026 · Q2 2026 · Q2 Cognition AI in talks at $25B Cognition — maker of Devin, the first broadly-deployed AI software engineer — enters funding talks at a $25B valuation, more than double its September 2025 mark. Devin's ARR grew from $1M in September 2024 to $73M by June 2025, a 73x run-rate in nine months. The round is not closed and terms may shift, but the trajectory is the point: agentic coding products are compounding at a pace that makes 2024 benchmarks look quaint. https://www.bloomberg.com/news/articles/2026-04-23/ai-coding-firm-cognition-in-funding-talks-at-25-billion-value Cognition AI in talks at $25B-2026 · Q2 2026 · Q2 Project Glasswing Anthropic launches a cybersecurity coalition with AWS, Apple, Google, Microsoft, and others — backed by Claude Mythos Preview, a new frontier model optimized for finding software vulnerabilities. The initiative commits $100M in model credits and $4M to open-source security. The framing: ensure advanced AI cyber capabilities reach defenders before attackers. The notable subtext: Claude Mythos is the first public signal of a post-Opus model in the pipeline. https://www.anthropic.com/glasswing Project Glasswing-2026 · Q2 2026 · Q2 Gemma 4 Google DeepMind ships Gemma 4 in four sizes — E2B, E4B, 26B MoE, and 31B Dense — distilled from Gemini 3 and released under Apache 2.0 for the first time in the family's history. The 31B scores 89.2% on AIME 2026 (+68 points over Gemma 3), 80% on LiveCodeBench, and 86.4% on agentic benchmarks. It's the #3 open model on the Arena leaderboard — a 31B model outperforming 400B-class competitors. The open-weight frontier just moved again. https://deepmind.google/models/gemma/gemma-4/ Gemma 4-2026 · Q2 2026 · Q2 OpenAI raises $122B at $852B valuation OpenAI closes the largest private funding round in history: $122B with Amazon ($50B), Nvidia ($30B), and SoftBank ($30B) as lead investors. Valuation hits $852B. For context: that's larger than most sovereign wealth funds and nearly every public tech company outside the Mag-7. The capital is explicitly for compute and infrastructure, not product. The race for GPU clusters is now denominated in hundreds of billions. https://openai.com/index/accelerating-the-next-phase-ai/ OpenAI raises $122B at $852B valuation-2026 · Q2 2026 · Q2 GPT-5.4 OpenAI deprecates GPT-5.1 and ships GPT-5.4, GPT-5.4 Thinking, and GPT-5.4 mini. The model line restructure collapses the older GPT-5.x variants into three tiers: instant, thinking, and pro — mirroring how Anthropic structured the Claude 4 family. ChatGPT also gains CarPlay integration, a File Library, and interactive math and science modules for 70+ topics. The platform is quietly becoming an OS-level interface. https://openai.com/news/ GPT-5.4-2026 · Q2 2026 · Q2 Meta Muse Spark Meta's first major model release since acquiring Scale AI's Alexandr Wang scores #4 on the Artificial Analysis Intelligence Index with strong multimodal, reasoning, health, and agentic results — at a fraction of the compute cost of Llama 4 Maverick. Meta guided $115–135B in AI capex for 2026. The signal: the open-model era is over; Meta is building proprietary closed frontier models now. https://www.cnbc.com/2026/04/08/meta-debuts-first-major-ai-model-since-14-billion-deal-to-bring-in-alexandr-wang.html Meta Muse Spark-2026 · Q2 2026 · Q2 Grok 4.20 + xAI Series E xAI ships Grok 4.20 with the strongest current-events accuracy of any frontier model at release — a direct product of real-time X social data integration that no other lab has access to at equivalent scale. The model closes the factuality gap that plagued Grok 3, with Grok 4.20 leading on news accuracy within a 30-day window. The 4-agent architecture ships here: Grok as coordinator, Harper as research, Benjamin for logic and mathematics, and Lucas as contrarian analysis — all running in parallel and cross-verifying outputs before a response surfaces. Intelligence Index score: 49, placing it competitively with but not clearly ahead of GPT-5.4 or Claude Opus 4.7 on general benchmarks. Simultaneously, xAI closes a $20B Series E with Nvidia, Cisco, QIA, and others. The pairing is intentional: the money goes straight to compute, the model is the proof of concept that it's being spent well. https://releasebot.io/updates/xai Grok 4.20 + xAI Series E-2026 · Q2 2026 · Q2 The Anthropic Institute Anthropic spins out a dedicated research organization led by co-founder Jack Clark — who spent five years at OpenAI as Policy Director before leaving to co-found Anthropic, and who runs Import AI, a newsletter with 70,000 weekly subscribers that has tracked every major model release since 2017. The Institute consolidates three teams: Frontier Red Team (stress-testing models at the outer edge of their capabilities), Societal Impacts (tracking real-world deployment effects), and Economic Research (measuring labor-market shifts as AI scales). The mandate is deliberately empirical: produce primary data about how AI is affecting workers and economies, rather than relying on modeling or extrapolation, and publish findings even when they're uncomfortable. Hires include Matt Botvinick from Google DeepMind on AI and rule of law, Anton Korinek from UVA on economic transformation, and Zoë Hitzig, formerly of OpenAI, bridging economics and model development. The structural move matters: by separating safety and societal research from the product org, Anthropic is betting that credibility requires independence — a research institute that reports to the same team shipping Claude has a conflict of interest on every finding. https://www.anthropic.com/news/the-anthropic-institute The Anthropic Institute-2026 · Q1 2026 · Q1 Karpathy's autoresearch Karpathy releases a 630-line open-source script that lets an AI agent autonomously run ML experiments on a fixed compute budget — hypothesize, modify code, run, collect results, repeat overnight. After two days on a GPU, the agent found ~700 improvements, with ~20 transferring to larger models and delivering an 11% efficiency gain (Time-to-GPT-2 from 2.02h → 1.80h). The repo hit 85K+ stars. The cleanest possible signal that AI agents are now the primary tool for research. https://github.com/karpathy/autoresearch Karpathy's autoresearch-2026 · Q1 2026 · Q1 Tobi adapts autoresearch Within days of Karpathy's release, Tobi Lütke adapts autoresearch for a Shopify model training run — a 0.8B parameter model with a Raspberry Pi-based compute loop — and reports a 19% validation improvement. The public adaptation loop between practitioner and researcher signaled a new kind of open science: fork, run overnight, post results by morning. https://x.com/tobi Tobi adapts autoresearch-2026 · Q1 2026 · Q1 Claude Opus 4.6 Anthropic's flagship at release, later succeeded by Opus 4.7. The headline capability is multi-agent coordination: Opus 4.6 acts as an orchestrator that spins up and manages a team of specialized subagents, each running tool calls and accumulating context independently, then consolidates their outputs — distinct from earlier single-agent loops that just re-called the same model. The 128K max output token limit makes full codebase rewrites and long-form document generation tractable in a single pass; prior models capped at 8K. Extended thinking with tool use lets Claude alternate between chain-of-thought reasoning and live tool calls mid-problem, rather than reasoning first and then executing separately. On OSWorld — a benchmark that measures autonomous computer use across real GUI environments — it hits 72.7%, the state of the art at release; context is that Claude 3.5 Sonnet scored around 22% when computer use first shipped in late 2024, so the jump in eighteen months is nearly 50 percentage points. SWE-bench Verified: 80.8%, just above Sonnet 4.6 at 79.6%. Priced at $5/$25 per million tokens on a 200K context window. https://www.anthropic.com/news/claude-opus-4-5 Claude Opus 4.6-2026 · Q1 2026 · Q1 Claude Sonnet 4.6 The first Sonnet-class model to hit 1M token context, which matters because previous Sonnet models topped out at 200K — the jump makes full-repo analysis and large document corpora practical at mid-tier pricing. Ships February 17, 2026. SWE-bench Verified: 79.6%, within 1.2 points of Opus 4.6 (80.8%) at one-third the cost — at release, GPT-4.1 sat around 54% on the same benchmark, so Sonnet 4.6 lands well ahead of the prior OpenAI mid-tier. OSWorld: 72.5%, essentially tied with Opus 4.6 at 72.7%. Terminal-Bench 2.0: 59.1%. Developer telemetry shows users choosing Sonnet 4.6 over Opus 4.5 59% of the time — the preference flip confirms that the performance-per-dollar gap had closed enough that latency and price became the deciding factors. What changed from 4.5: adaptive thinking (the model dynamically scales reasoning depth per task rather than requiring a mode switch), native 1M context, and context compaction for sessions that would otherwise overflow. $3/$15 per million tokens. https://www.anthropic.com/news/claude-sonnet-4-6 Claude Sonnet 4.6-2026 · Q1 2026 · Q1 GPT-4.5 Released February 27, 2025, GPT-4.5 is OpenAI's largest non-reasoning model — the explicit bet that raw scale and better training data produce a qualitatively different conversational experience, not just higher benchmark numbers. SimpleQA accuracy: 62.5%, against 47.0% for GPT-4o — it hallucinated 37.1% of the time versus 59.8% for 4o, which is the headline improvement in practice. MMLU: 85.1%, above o3-mini's 81.1% on undergraduate-level knowledge. On math and hard science, o3-mini still wins by significant margins — the tradeoff is explicit. Where 4.5 separates itself is emotional register: it detects sentiment, adjusts tone, and tracks implicit intent across long conversations in ways that prior models treated as secondary concerns. Context window: 128K tokens. Pricing at launch was $75/$150 per million tokens — 5× the cost of GPT-4o — which made it effectively a research and power-user product rather than a default API choice. The positioning is 'know what you mean,' not 'think harder'; it was the last model OpenAI would release before the o-series and GPT-5 line fully absorbed the product roadmap. https://openai.com/index/introducing-gpt-4-5 GPT-4.5-2026 · Q1 2026 · Q1 Symphony Open-sourced March 4, 2026 under Apache 2.0 at github.com/openai/symphony, Symphony is a Codex App Server orchestration spec that turns a Linear board into a persistent agent control plane. The mechanism: every open issue gets a dedicated Codex agent in an isolated workspace; Symphony polls the board continuously, restarts stalled agents, picks up new issues as they appear, and runs the full ticket lifecycle — triage, code, tests, PR — without a human in the loop until review. The reference implementation was written in Elixir by Codex itself in a single pass; OpenAI then had Codex reimplement it in TypeScript, Go, Rust, Java, and Python to stress-test the spec for ambiguities. Internal teams using Symphony reported a 500% increase in landed PRs in the first three weeks — the metric that matters, since merged code is the unit of delivery. The model running each agent is Codex (GPT-5.x family). What makes it production-grade is the failure recovery: agents that crash or stall are automatically restarted, so the board drains rather than stalls. https://github.com/openai/symphony Symphony-2026 · Q1 2026 · Q1 everything is a ralph loop Huntley's thesis: stop building brick by brick, start programming the loop. A single monolithic, autonomous process that runs continuously — the agent *is* the system. The post reframes agentic development from task execution to loop design. Where your loop is slow or stuck, that's where your growth is. https://ghuntley.com/loop/ everything is a ralph loop-2026 · Q1 2026 · Q1 OpenAI Realtime API GA + gpt-realtime OpenAI takes the Realtime API from beta to general availability on August 28, paired with gpt-realtime — its most capable speech-to-speech model and a direct replacement for the stitched-together STT/LLM/TTS pipelines that had defined voice AI for three years. The capability jump is measurable: 66.5% on ComplexFuncBench audio eval versus 49.7% for the December 2024 model, meaning the model now reliably calls tools during voice sessions rather than fumbling them. Three new features ship at GA: image input (the model can see what the user is looking at during a voice call), remote MCP server support (live tool calls to external services mid-conversation), and SIP phone integration (connect directly to the public phone network and PBX systems without a telephony middleware layer). Two new voices — Cedar and Marin — are exclusive to the Realtime API. The SIP support is the underappreciated detail: it means the API can replace an IVR system, not just augment a chatbot. The pattern is the same one that played out in text APIs two years earlier — proprietary bespoke pipeline builds become commoditized surface, and the product race moves one level up. https://openai.com/index/introducing-gpt-realtime/ OpenAI Realtime API GA + gpt-realtime-2025 · Q3 2025 · Q3 Mistral Medium 3.1 Mistral ships Medium 3.1 on August 12 — a multimodal proprietary model with a custom-trained vision encoder, 128K context, and output at 113 tokens per second, priced at $0.40/$2.00 per million tokens. The pitch is explicit: roughly 90% of Claude Sonnet 3.7's performance at a fraction of the cost, with 74.4% on MMLU Pro and strong multimodal reasoning. The model is fully closed — no weights — which marks a quiet but significant strategic line for Mistral, a company that built its brand on open-weight releases. The commercial logic is clear: open models serve as marketing and research vehicles; Medium 3.1 is the product. For developers who need cost-efficient multimodal inference at production scale, it fills the gap between cheap general models and Claude/GPT tier pricing. The European differentiation adds real purchasing leverage — GDPR compliance, EU data residency, no American hyperscaler dependency in the call path. Mistral is quietly becoming a closed-model company while keeping the open-weight brand active, a dual strategy that lets it compete on both axes simultaneously. https://mistral.ai/news/mistral-medium-3/ Mistral Medium 3.1-2025 · Q3 2025 · Q3 GPT-5 OpenAI ships GPT-5 on August 7 — not as a single new model but as a unified system: a fast tier for most requests, a thinking tier (GPT-5 thinking) for hard problems, and a real-time router that picks between them without user intervention. Benchmarks at release: 74.9% on SWE-bench Verified (leading all models), 94.6% on AIME 2025, 88.4% on GPQA expert science. Context is 400K tokens with up to 128K output. Pricing spans three tiers — standard at $1.25/$10, mini at $0.50/$5, nano at $0.15/$1.50 per million tokens — the first time OpenAI had occupied every price point in a single release family. The architectural story matters more than any individual benchmark: GPT-5 eliminates the manual model selection problem that had accumulated since o1. Developers no longer decide which reasoning class to route a request to; the system decides. The nano tier at $0.15/$1.50 is a pricing signal as much as a product — OpenAI is deliberately preventing smaller providers from owning the cost-sensitive production tier. GPT-5 is the first OpenAI release designed to retire its entire prior generation in one move rather than layering on top. https://openai.com/index/introducing-gpt-5/ GPT-5-2025 · Q3 2025 · Q3 NVIDIA hits $4T NVIDIA closes at a $4 trillion market capitalization on July 10 — the first publicly traded company to reach that level, clearing Apple and Microsoft which had plateaued above $3T. The milestone is a direct product of the AI infrastructure buildout: every major lab and hyperscaler was deploying GPU clusters on a schedule measured in gigawatts, NVIDIA held the only production-grade training and inference hardware at the frontier, and the H100 had become the reserve currency of the AI economy. At $4T, NVIDIA was larger than the entire UK stock market and roughly equivalent to the GDP of Germany. The architecture moat — CUDA, NVLink, the NIM inference stack — had been built over two decades and had no realistic 18-month challenger. The market was not pricing near-term chip cycle revenue; it was pricing perceived multi-decade infrastructure dominance. Whether or not that view holds depends on how quickly AMD, custom silicon from Google and Amazon, and open alternatives like ROCM can close the software stack gap. In 2025, they hadn't. The $4T close was the clearest single data point that the AI compute race had become a permanent infrastructure category, not a temporary capex surge. https://www.washingtonpost.com/technology/2025/07/10/nvidia-4-trillion-market-cap/ NVIDIA hits $4T-2025 · Q3 2025 · Q3 Grok 4 xAI ships Grok 4 and Grok 4 Heavy on July 9, unveiled via a livestream that drew 1.5 million concurrent viewers — the largest live audience for a model launch to that point. The benchmark that got the most attention was Humanity's Last Exam: Grok 4 Heavy scored above 50% on the text-only subset, the first model to clear that threshold on a benchmark explicitly designed to resist saturation. USAMO 2025 math proofs at 61.9%, ARC-AGI V2 at 15.9%. Trained on xAI's 200,000-GPU Colossus cluster using 6× more compute than Grok 3, with a 256K token context window. Native tool use is the architectural distinction: Grok 4 was trained to autonomously select its own web search queries and run a code interpreter mid-reasoning, rather than receiving tool calls as post-training injections. Grok 4 Heavy runs multiple reasoning agents in parallel at inference time, matching the test-time compute scaling pattern OpenAI had explored with o1-pro. The live audience and HLE score together mark xAI's transition from frontier competitor to challenger on the hardest tasks in the benchmark suite. The X data integration advantage — real-time access to the highest-velocity public information source without an API intermediary — remains the product moat no other lab can replicate. https://x.ai/news Grok 4-2025 · Q3 2025 · Q3 Gemini 2.5 Flash GA Google moves Gemini 2.5 Flash to general availability on June 17, completing the transition from experimental and preview releases. The model is the workhorse of the 2.5 family: 1M token context, built-in thinking mode, strong on reasoning and coding, priced at $0.30/$2.50 per million tokens — roughly a quarter of 2.5 Pro pricing for use cases where the additional capability headroom isn't required. Flash-Lite reached stable production status separately on July 22, at 1.5× the speed of Gemini 2.0 Flash at lower cost. The GA announcement reported a 25% improvement across internal benchmarks versus preview versions. The practical impact for developers was immediate: stable model IDs, committed deprecation timelines, and SLA-backed availability — the infrastructure properties that make a model usable in production rather than just interesting in a notebook. Google's two-track Flash/Pro strategy mirrors the competitive structure Anthropic established with Haiku/Sonnet/Opus, and means Google is now competing directly for the high-throughput API tier where most production tokens are actually spent, not just the prestige frontier tier where benchmarks are made. https://cloud.google.com/blog/products/ai-machine-learning/gemini-2-5-flash-lite-flash-pro-ga-vertex-ai Gemini 2.5 Flash GA-2025 · Q3 2025 · Q3