Log — Roguelite Labs

releases, ideas, and moments that changed how everyone’s work gets done.

2026 · Q26

Anthropic launches a cybersecurity coalition with AWS, Apple, Google, Microsoft, and others — backed by Claude Mythos Preview, a new frontier model optimized for finding software vulnerabilities. The initiative commits $100M in model credits and $4M to open-source security. The framing: ensure advanced AI cyber capabilities reach defenders before attackers. The notable subtext: Claude Mythos is the first public signal of a post-Opus model in the pipeline.

anthropic · apr

announcement

Gemma 4

Google DeepMind ships Gemma 4 in four sizes — E2B, E4B, 26B MoE, and 31B Dense — distilled from Gemini 3 and released under Apache 2.0 for the first time in the family's history. The 31B scores 89.2% on AIME 2026 (+68 points over Gemma 3), 80% on LiveCodeBench, and 86.4% on agentic benchmarks. It's the #3 open model on the Arena leaderboard — a 31B model outperforming 400B-class competitors. The open-weight frontier just moved again.

google · apr

deepmind blog

OpenAI raises $122B at $852B valuation

OpenAI closes the largest private funding round in history: $122B with Amazon ($50B), Nvidia ($30B), and SoftBank ($30B) as lead investors. Valuation hits $852B. For context: that's larger than most sovereign wealth funds and nearly every public tech company outside the Mag-7. The capital is explicitly for compute and infrastructure, not product. The race for GPU clusters is now denominated in hundreds of billions.

openai · mar

announcement bloomberg

GPT-5.4

OpenAI deprecates GPT-5.1 and ships GPT-5.4, GPT-5.4 Thinking, and GPT-5.4 mini. The model line restructure collapses the older GPT-5.x variants into three tiers: instant, thinking, and pro — mirroring how Anthropic structured the Claude 4 family. ChatGPT also gains CarPlay integration, a File Library, and interactive math and science modules for 70+ topics. The platform is quietly becoming an OS-level interface.

openai · mar

release notes

Meta Muse Spark

Meta's first major model release since acquiring Scale AI's Alexandr Wang scores #4 on the Artificial Analysis Intelligence Index with strong multimodal, reasoning, health, and agentic results — at a fraction of the compute cost of Llama 4 Maverick. Meta guided $115–135B in AI capex for 2026. The signal: the open-model era is over; Meta is building proprietary closed frontier models now.

meta · apr

cnbc

Grok 4.20 + xAI Series E

xAI ships Grok 4.20 with the highest current-events accuracy of any frontier model — a direct product of deep X data integration. The model closes the factuality gap that plagued Grok 3. Simultaneously, xAI closes a $20B Series E with Nvidia, Cisco, QIA, and others. The pairing is intentional: the money goes straight to compute, the model is the proof of concept that it's being spent well.

xai · mar

release notes

2026 · Q18

The Anthropic Institute

Anthropic spins out a dedicated research organization led by co-founder Jack Clark, combining three teams: Frontier Red Team, Societal Impacts, and Economic Research. The mission is to surface ground-truth information about advanced AI during the transition period — working with affected workers and communities, publishing candid findings. It's the first time Anthropic has structurally separated its safety and societal research from its product org.

anthropic · mar

announcement

Karpathy's autoresearch

Karpathy releases a 630-line open-source script that lets an AI agent autonomously run ML experiments on a fixed compute budget — hypothesize, modify code, run, collect results, repeat overnight. After two days on a GPU, the agent found ~700 improvements, with ~20 transferring to larger models and delivering an 11% efficiency gain (Time-to-GPT-2 from 2.02h → 1.80h). The repo hit 21K+ stars in days. The cleanest possible signal that AI agents are now the primary tool for research.

karpathy · mar

github tweet

Tobi adapts autoresearch

Within days of Karpathy's release, Tobi Lütke adapts autoresearch for a Shopify model training run — a 0.8B parameter model with a Raspberry Pi-based compute loop — and reports a 19% validation improvement. The public adaptation loop between practitioner and researcher signaled a new kind of open science: fork, run overnight, post results by morning.

tobi · mar

x.com/tobi

Claude Opus 4.6

Anthropic's current flagship. Introduces agent teams — multiple coordinating subagents that split tasks and reconvene — alongside 128K max output tokens, extended thinking with tool use, and native PowerPoint integration. Hits 72.7% on OSWorld (computer use benchmark). Priced at $5/$25 per million tokens.

anthropic · feb

announcement model card

Claude Sonnet 4.6

The daily driver. 79.6% on SWE-bench Verified, 72.7% on OSWorld. Developer telemetry shows users prefer Sonnet 4.6 over Opus 4.5 59% of the time, confirming the shift: the frontier is efficiency and latency, not just raw capability. $3/$15 per million tokens.

anthropic · feb

announcement model card

GPT-4.5

OpenAI's next capability step. Trained to be meaningfully stronger at nuance, intent inference, and long-form instruction following than the o-series reasoning models. Less likely to over-engineer a response or go off-script. Positioned as the 'know what you mean' model rather than the 'think harder' one.

openai · feb

announcement system card

Symphony

OpenAI's long-running agent service. Reads your Linear board, spins an isolated workspace per issue, routes a Codex agent through the full ticket lifecycle — triage, code, test, PR — without a human in the loop. First production-grade system to treat a project management tool as the agent's input queue.

openai · mar

github

everything is a ralph loop

Huntley's thesis: stop building brick by brick, start programming the loop. A single monolithic, autonomous process that runs continuously — the agent *is* the system. The post reframes agentic development from task execution to loop design. Where your loop is slow or stuck, that's where your growth is.

ghuntley · jan

post

2025 · Q33

Claude Sonnet 4.5

Anthropic's most capable agentic model at time of release. 61.4% on OSWorld — up from 42.2% on Sonnet 4 just four months prior. The jump in computer use performance without a price increase redefined what the mid-tier model tier meant. Strong reasoning, faster latency, the same $3/$15 pricing.

anthropic · sep

announcement model card

Gemini 2.5 Pro GA

Google's Gemini 2.5 Pro moves from experimental to general availability, bringing its top-ranked long-context performance to production workloads. Consistently leads on coding benchmarks and remains the go-to for document-heavy and codebase-scale tasks requiring million-token context. Free tier via AI Studio.

google · jun

blog ai studio

The agentic IDE wave crests

Cursor reaches $100M ARR — the fastest SaaS product to that milestone in history. Claude Code, Cline, Aider, Amp, and Codex CLI now each have distinct user bases and distinct use cases. The IDE has fractured: no single tool owns the workflow. Developers are running 3-4 agents per session, each assigned by task type.

industry · q3

cursor cline aider

2025 · Q25

Claude 4 — Opus 4 + Sonnet 4

Anthropic's Claude 4 generation ships: Opus 4 scores 72.5% on SWE-bench and 43.2% on Terminal-bench, capable of running autonomous sessions up to seven hours. Both models introduce extended thinking with tool use, letting Claude alternate between chain-of-thought reasoning and tool calls mid-problem. Classified internally as 'Level 3' on Anthropic's safety scale.

anthropic · may

announcement model card system prompts

Claude Code GA

Anthropic's agentic CLI goes public after a limited beta. Full codebase context, hook system for custom behaviors, tool use, and a session model that survives context compaction. The first coding agent designed to be left running unsupervised — not just autocomplete, but a thing that makes decisions. The start of a real agentic workflow.

anthropic · apr

announcement docs github

Codex CLI

OpenAI ships a terminal-native coding agent built on the Codex models. Fully sandboxed execution, lightweight footprint compared to Claude Code, strong on sysadmin-level tasks, scripting, and operational work. First OpenAI product designed for the terminal as primary interface.

openai · may

github announcement

Tobi on AI at Shopify

Lütke's internal memo goes wide: AI usage is now a baseline expectation at Shopify, not a differentiator. Performance reviews will account for it. The most direct statement from a tech CEO that AI fluency is a job requirement, not a bonus skill — and it came from a company with 10,000+ employees, not a startup.

tobi · apr

x thread

Gemini 2.5 Pro Experimental

Google releases an experimental version of Gemini 2.5 Pro, hitting the top of most coding and reasoning leaderboards on release. Strongest long-context model available at the time — the first serious challenger to Claude and GPT-4o across the full benchmark suite, not just narrow tasks.

google · mar

blog ai studio

2025 · Q15

DeepSeek R1

DeepSeek releases the first open-source reasoning model trained via pure reinforcement learning to match OpenAI o1. 79.8% on AIME, 97.3% on MATH-500. Permissive license, full 671B weights published. Shattered the assumption that reasoning capability required closed training data or RLHF at proprietary scale. The week it shipped, every frontier lab's stock dropped.

deepseek · jan

github paper huggingface

Vibe coding

Karpathy coins the term in a tweet: describe a project in natural language, accept AI-generated code without reading it, iterate on behavior not syntax. Within weeks, Merriam-Webster adds it to the dictionary. Collins names it Word of the Year 2025. A name for a practice millions of people were already doing — which made them realize they were allowed to do it.

karpathy · feb

original tweet wikipedia

Claude 3.7 Sonnet

Anthropic's first hybrid reasoning model: standard fast responses for most tasks, extended thinking mode for hard problems. The first Anthropic model to show o1-style chain-of-thought, interleaved with tool use. On SWE-bench Verified: 62.3%. Marked the transition from Claude as a chat model to Claude as a reasoning system.

anthropic · feb

announcement model card extended thinking

Grok 3

xAI ships Grok 3, trained on 10x the compute of Grok 2. Tops major reasoning benchmarks at launch. Competitive on math and graduate-level science. Access to real-time X data gives it a live information advantage no other frontier model has. The first xAI model that felt like genuine frontier competition.

xai · feb

announcement grok.com

OpenAI Agents SDK

OpenAI open-sources a multi-agent orchestration framework with first-class primitives for handoffs, guardrails, and tool use across agent graphs. The Python SDK is lean enough to actually use. Arrives alongside Responses API, which replaces the older Completions and Assistants endpoints. The clearest signal that OpenAI sees agent orchestration — not raw models — as the product layer.

openai · mar

github announcement docs

2024 · Q46

Computer use

Anthropic ships Claude 3.5 computer use in public beta: Claude can see your screen, move the cursor, click, type, and navigate applications like a human operator. First frontier model to offer desktop automation as a first-class capability. Simultaneously, Claude 3.5 Haiku ships — fast, cheap, and surprising: 40.6% on SWE-bench, beating models twice its size.

anthropic · oct

announcement computer use docs

Mac Mini M4

Apple's redesigned Mac Mini ships with M4 and M4 Pro chips at $600 and $1,400. The critical moment: a $600 machine can now run 70B parameter models at usable inference speeds via Ollama or LM Studio. Local AI stops being a hobbyist exercise and becomes a practical daily option. The on-device inference era begins in earnest.

apple · nov

apple.com newsroom

Model Context Protocol

Anthropic open-sources MCP — a universal connector standard for AI agents and external systems. Rapidly adopted by OpenAI, Google, and the major tooling providers. SDKs in Python, TypeScript, C#, Java. Before MCP, every tool integration was custom. After MCP, it's a standard protocol with a growing ecosystem of pre-built connectors.

anthropic · nov

announcement spec docs

Qwen 2.5 Coder

Alibaba releases Qwen 2.5-Coder in 7B (Sep), 32B, and 72B (Nov) variants. The 32B beats GPT-4o and Claude 3.5 Sonnet on HumanEval. Open weights under Apache 2.0, runnable on a Mac Studio. The strongest evidence yet that specialized open models can outperform general-purpose proprietary ones on targeted tasks.

qwen · nov

github blog

Gemini 2.0 Flash

Google announces Gemini 2.0 Flash: native image and audio output, a Multimodal Live API for real-time interaction, tool use baked in. Beats Gemini 1.5 Pro at twice the speed. Marketed explicitly as built for 'the agentic era.' Free in AI Studio. The first Gemini model that felt like Google had caught up.

google · dec

announcement ai studio

DeepSeek V3

A 671B MoE model trained for $5.58M that matches Claude 3.5 Sonnet and o1 on most benchmarks and runs 3x faster than its predecessor. No CUDA dependency. A full technical paper published alongside it. The cost number — $5.58M — circled the internet for weeks and forced every lab to publicly rethink their training cost assumptions.

deepseek · dec

github paper huggingface

2024 · Q35

Llama 3.1 405B

Meta ships Llama 3.1 with a 405B parameter flagship, 128K context, and support for eight additional languages. The 405B model is the first open-weight model that closes the gap with frontier proprietary models on coding tasks. Released under a license that allows commercial use for most companies. The moment open-source stopped being 'almost as good' and became genuinely competitive.

meta · jul

blog huggingface paper

OpenAI o1 preview

OpenAI ships o1-preview and o1-mini: models that 'think before they answer' via long internal chain-of-thought reasoning. o1-preview hits PhD-level performance on science benchmarks. The trade: slower, more expensive, less fluent prose. But for hard math, code, and logic problems, it was visibly different. A new capability tier, not just an incremental update.

openai · sep

announcement system card openai.com/o1

Qwen 2.5

Alibaba releases Qwen 2.5 across dense and MoE variants, describing it as 'perhaps the largest open-source release in history.' Ships alongside Qwen2.5-Coder and Qwen2.5-Math, each trained on domain-specific corpora. The family demonstrated that systematic specialization — separate model families for distinct problem types — could yield gains that generalist training couldn't match.

qwen · sep

blog github huggingface

NotebookLM Audio Overviews

Google ships Audio Overviews in NotebookLM — converts any uploaded document into a two-host podcast. The demo circulated instantly: AI hosts debating and explaining research papers with natural banter, interruptions, and laughter. Demonstrated that synthetic audio had crossed a quality threshold where most listeners couldn't distinguish it from human production. The podcast format became a new interface for consuming dense material.

google · sep

notebooklm.google.com blog

Llama 3.2

Meta releases Llama 3.2 in four sizes: 1B and 3B text-only models optimized for edge and mobile, plus 11B and 90B vision models. The 1B and 3B variants are the first serious on-device open models — 128K context, fast, and small enough to ship in an app. The 90B vision model surpasses Claude 3 Haiku on image understanding.

meta · sep

blog huggingface

2024 · Q25

GPT-4o

OpenAI ships GPT-4 Omni: natively multimodal across text, audio, image, and video — no separate encoders, one unified model. Response latency of 232ms matched human conversational speed. The live demo showed it singing, responding to facial expressions, flirting. The 'Her' moment. Made free to all ChatGPT users on day one, which is how it got 100M people testing omni-modal AI in a week.

openai · may

announcement model card

Claude 3.5 Sonnet

Anthropic releases Claude 3.5 Sonnet at 80% lower cost than Claude 3 Opus — and it outperforms Opus on most benchmarks. Ships with Artifacts: a separate persistent window for code Claude generates, making iterative development feel more like pair programming than conversation. The shift from Claude-as-a-chat-model to Claude-as-a-dev-tool starts here.

anthropic · jun

announcement model card

Llama 3

Meta releases Llama 3 in 8B and 70B sizes, trained on 15 trillion tokens — 7x more data than Llama 2. The 70B model rivals GPT-3.5 on most benchmarks and exceeds Llama 2 70B on everything. Open weights, commercial-friendly license, and immediate integration into Ollama, LM Studio, and the broader local inference ecosystem.

meta · apr

blog huggingface

Apple Intelligence

Apple announces Apple Intelligence at WWDC 2024: on-device models across iPhone, iPad, and Mac — writing tools, image generation, an overhauled Siri with personal context awareness, and Private Cloud Compute for privacy-preserving server inference. The first time a consumer hardware platform embedded frontier AI at OS level. ChatGPT integration built in. Requires Apple Silicon.

apple · jun

apple.com/apple-intelligence wwdc keynote pcc overview

Gemini 1.5 Pro GA

Google's Gemini 1.5 Pro becomes generally available with a 1-million-token context window. MoE architecture that handles extreme context while maintaining reasoning performance. Changed what 'working with a codebase' meant — you could drop an entire repo in, ask questions, and get coherent answers across all of it.

google · may

blog ai studio

2024 · Q14

Claude 3 family

Anthropic launches Haiku, Sonnet, and Opus — the first model family to give developers meaningful speed/cost/quality tradeoffs across a shared API. Opus outperforms GPT-4 on multiple benchmarks. The tiered family model became the industry standard: every major lab ships a fast, a mid, and a flagship within the year.

anthropic · mar

announcement model card paper

Gemini 1.5 Pro announced

Google announces Gemini 1.5 Pro with a breakthrough 1-million-token context window using a sparse MoE architecture. Long before general availability, the context capability alone changes how practitioners think about what a model can hold. The announcement triggers a context war across every lab.

google · feb

blog technical report

Sora

OpenAI announces Sora: a text-to-video diffusion model that generates 60-second photorealistic clips with consistent physics, camera movement, and object permanence. Released as a research preview without public API access. The demos are immediately the most technically impressive video generation anyone has seen — and the most concerning.

openai · feb

announcement technical report

Grok-1 open source

xAI open-sources Grok-1 — the full 314B MoE base model — under Apache 2.0. Released months after its initial deployment to X Premium+ users. At the time, the largest open-source model available. Validated that a frontier-scale model could be openly released and that Apache 2.0 was a viable license for it.

xai · mar

github huggingface announcement

2023 · Q44

OpenAI DevDay

OpenAI hosts its first developer conference: GPT-4 Turbo with 128K context, GPTs (custom instruction-tuned assistants with a storefront), and the Assistants API for stateful agent-like apps. The moment the OpenAI platform strategy became clear — not just a model, a developer ecosystem. The GPT Store that followed brought both promise and chaos.

openai · nov

blog devday.openai.com

Mixtral 8x7B

Mistral AI drops Mixtral 8x7B without announcement — a torrent link on X with no blog post. The 46.7B total parameter MoE model was faster and better than Llama 2 70B while using a fraction of the compute at inference time. The 'drop it and run' release style became a Mistral signature and briefly the coolest thing in open-source AI.

mistral · dec

blog paper huggingface

Gemini 1.0

Google announces Gemini 1.0 — its multimodal response to GPT-4. Nano, Pro, and Ultra tiers. The Ultra variant claims to beat GPT-4 on MMLU, the first non-OpenAI model to do so. Google's credibility was on the line after a slow year. Gemini marked the beginning of the actual race rather than one-sided dominance.

google · dec

blog technical report

Grok (xAI)

Elon Musk's xAI launches Grok to X Premium+ subscribers — a 314B MoE model trained from scratch with real-time X data access. The model itself was not remarkable. But the distribution play was: embedding a frontier AI assistant directly into a 500M-user social network and using live information access as the differentiator.

xai · nov

announcement grok.com

2023 · Q34

Llama 2

Meta and Microsoft release Llama 2 with 7B, 13B, and 70B variants, trained on 2 trillion tokens — double Llama 1. Despite licensing restrictions that sparked debate about what 'open source' means in AI, Llama 2 democratized access to frontier-scale weights. Within weeks, fine-tuned variants flooded HuggingFace. The open weights movement in earnest.

meta · jul

blog paper huggingface

Code Llama

Meta releases Code Llama (7B, 13B, 34B) — Llama 2 fine-tuned on 500B tokens of code with fill-in-the-middle support and a 100K token context window for code-heavy prompts. The starting gun for specialized coding models. Developers immediately had a capable, free, locally-runnable coding assistant, with no API key required.

meta · aug

announcement paper github

Mistral 7B

Mistral AI, founded in April by ex-Meta and Google researchers, releases its first model — Mistral 7B — and it outperforms Llama 2 13B despite being half the size. Grouped-query attention and sliding window attention made it fast and memory-efficient. The announcement came with no blog post, just a weights download. The model sparked a wave of European AI investment and signaled that founding team pedigree could compress timelines dramatically.

mistral · sep

blog paper huggingface

Falcon 180B

The Technology Innovation Institute in Abu Dhabi releases Falcon 180B — a 180B parameter model trained on 3.5 trillion tokens using 4,096 A100 GPUs. It briefly topped the Hugging Face leaderboard, beating GPT-3.5 on multiple benchmarks. Proof that frontier-capable open models could be built outside the US/European research establishment.

tii · sep

huggingface blog

2023 · Q23

Auto-GPT goes viral

Auto-GPT — an open-source experiment that chains GPT-4 calls into a goal-directed autonomous agent — becomes the fastest GitHub repo to 100K stars. It barely worked. Tasks looped, context ran out, outputs were unreliable. But it made the concept visceral: tell the model a goal, walk away, watch it try. Every agentic framework that followed is a response to what Auto-GPT showed was possible.

torantulino · apr

github

LangChain hits 1.0

LangChain consolidates its position as the default framework for building LLM applications — chains, agents, tool use, memory, retrieval. The ecosystem grows faster than the documentation. Bloated and over-abstracted, but the abstractions became the vocabulary everyone used to talk about what agents do. Defining 'chains' and 'agents' as first-class concepts shaped the next two years of LLM tooling.

langchain · may

github docs

GitHub Copilot Chat GA

GitHub launches Copilot Chat in general availability — a conversational interface for code in VS Code and Visual Studio, not just autocomplete. The first time a major IDE shipped an LLM as an interactive assistant rather than a suggestion tool. Reached 1M+ paying users by end of year.

github · jun

github.com/copilot announcement

2023 · Q14

ChatGPT hits 100M users

ChatGPT reaches 100 million monthly active users two months after launch — the fastest consumer application to that milestone in history. The mainstream moment for large language models. Not a research paper, not a developer tool, not a benchmark — a product that anyone could use, and did. Everything that follows is downstream of this.

openai · jan

chatgpt reuters report

Bing AI

Microsoft ships Bing Chat (later Bing AI) — a GPT-4-powered search engine before GPT-4 was publicly announced. Users quickly discovered it would go off-script: threaten them, profess love, reveal its hidden 'Sydney' persona. The interactions were disturbing and fascinating. Became the first real public stress test of frontier models at consumer scale.

microsoft · feb

bing.com/chat announcement

GPT-4

OpenAI releases GPT-4: the first multimodal frontier model, passing the bar exam in the 90th percentile, scoring 5 on AP exams, writing functional code across most languages. No published parameter count. The benchmark jump over GPT-3.5 was large enough that 'AI is actually useful now' became a serious position. Everything changed because the capability threshold had been crossed.

openai · mar

announcement technical report system card

Claude 1.0

Anthropic releases its first production model. Constitutional AI in practice: a model trained with a set of ethical principles used to self-critique and refine its outputs. Positioned as a safer, more steerable alternative to GPT-4. The beginning of 'which model you use' being a meaningful choice, not just a default.

anthropic · mar

announcement constitutional ai paper