Roguelite Labs Wiki

A working knowledge base for AI tools, models, and agentic workflows.

Companion to roguelitelabs.xyz — browse /stack and /log on the main site.

Models

Capability profiles, benchmarks, pricing, and honest assessments.

  • claude-opus-4-8 — 82.3% OSWorld-Verified, adaptive thinking, dynamic parallel subagents, current Anthropic flagship
  • claude-sonnet-4-6 — 79.6% SWE-bench, 72.7% OSWorld, current agentic default
  • claude-opus-4-6 — 128K output tokens, orchestrator role, legacy
  • gpt-5-5 — 82.7% Terminal-Bench 2.0, agentic-first design, dominant ecosystem
  • gpt-5-4 — three-tier system (instant/thinking/pro)
  • gemini-3-5-flash — 76.2% Terminal-Bench 2.1, 4× faster inference, $1.50/$9 pricing
  • gemini-2-5-pro — best long-context, leads coding benchmarks, free via AI Studio
  • gemma-4 — 31B Apache 2.0, 89.2% AIME 2026, #3 open model
  • deepseek-r1 — 671B MIT license, 97.3% MATH-500, best open math reasoning
  • grok-4 — best current-events accuracy, deep X data integration
  • kimi-k2 — 1T MoE from Moonshot AI, MIT license, leading open-weights Intelligence Index score
  • claude-3-5-sonnet — 49.0% SWE-bench Verified (Oct update), first computer use in public beta, $3/$15 pricing
  • gpt-4o — natively omnimodal (text/audio/image/video), ~320ms voice latency, 128K context, $2.50/$10 pricing
  • o1 — chain-of-thought reasoning model, 97% MATH-500, 77.3% GPQA Diamond, first to beat PhD experts
  • llama-4 — Scout (17B/109B MoE, 10M context) and Maverick (17B/400B MoE), natively multimodal, April 2025
  • claude-haiku-4-5 — $1/$5 pricing, 73.3% SWE-bench, 2× faster than Sonnet 4.5, computer use + extended thinking
  • gemini-2-5-ultra — Gemini 2.5 Pro + Deep Think reasoning mode; Ultra = subscription tier, not a separate model
  • llama-3-1 — 8B/70B/405B open-weights, 128K context, commercial license, first open frontier-class model
  • claude-3-7-sonnet — first hybrid reasoning model, introduced extended thinking, superseded by 4.x
  • nvidia-nemotron-3-ultra — 550B MoE, 48 AA Intelligence Index, #1 US open-weights, 1M context, 300+ tok/s
  • microsoft-mai-thinking-1 — ~1T MoE, 97.0% AIME 2025, commercially licensed training, Maia 200 inference
  • minimax-m3 — MSA architecture, 1M context, 59.0% SWE-Bench Pro, 9×/15× prefill/decode speedup vs M2

Tools

Agents, editors, and runtimes that make up a working AI stack.

  • claude-code — agentic CLI, hook system, context compaction
  • codex-cli — OpenAI terminal agent, sandboxed execution, open-source
  • cursor — AI-native VS Code fork, $100M ARR, large codebase navigation
  • amp-code — Sourcegraph agent, think-out-loud mode, multi-model
  • ollama — local inference CLI, pipeline-friendly, OpenAI-compatible API
  • lm-studio — GUI local model runner, good Apple Silicon support
  • exe-dev — remote VMs for overnight Claude Code sessions
  • perplexity-comet — AI-native browser, agent executes web tasks autonomously, first consumer computer-use product
  • windsurf — VS Code fork by Codeium, Cascade flow-based agent, acquired by Cognition July 2025
  • cline — open-source VS Code agent extension, model-agnostic, Plan and Act mode, no subscription
  • github-copilot — Microsoft/GitHub assistant, Workspace browser agent, Project Polaris model, 20M+ users

Concepts

The underlying ideas, protocols, and benchmarks worth understanding.

  • extended-thinking — chain-of-thought interleaved with tool use, first in Claude 3.7
  • reasoning-models — thinking tokens, o1/o3/R1, trade-offs vs fast models, benchmarks
  • tool-use — structured function calling, client vs server tools, agentic loops
  • prompt-caching — cache static prefixes, 90% cost reduction on cache reads
  • context-window — token limits, 1M context, lost-in-the-middle problem
  • computer-use — model sees screen, moves cursor, clicks; 72.7% OSWorld
  • mcp — Model Context Protocol, standard for tool/context integration
  • vibe-coding — Karpathy Feb 2025, iterate on behavior not implementation
  • evals — SWE-bench, OSWorld, AIME, LiveCodeBench explained
  • rag — retrieval augmented generation, grounding in external documents
  • agentic-workflows — multi-step autonomous agent loops, patterns and failure modes
  • language-of-thought — Fodor's Mentalese hypothesis; systematicity, compositionality, and what reasoning requires
  • lot-llm-paradox — three positions on whether LLMs reason; the system-level synthesis
  • linguistic-relativity — Sapir-Whorf, Boroditsky, Pica on number words; language as cognitive infrastructure
  • future-time-reference — Chen (2013) on FTR grammar and savings behavior; the AI collaboration implication
  • extended-cognition — EC vs Extended Mind; why the distinction matters; literacy as coupling mechanism
  • distributed-cognition — Hutchins; cognitive processes as system-level properties; ship navigation and cockpit studies
  • tools-for-thought — Bush → Engelbart → Kay → Victor → Nielsen/Matuschak; augmentation vs automation
  • hci-ai — Norman's gulfs, Suchman's situated action, Endsley's situation awareness; HCI in the AI paradigm
  • assistive-technology — the AT evolution arc, curb cut effect, AAC; scaffold vs substitute
  • ephemeral-software — on-demand generated tools; specification as the new development bottleneck
  • personalized-systems — Licklider's symbiosis vision, PKM, adaptive cognitive scaffolding
  • universal-design-cognition — six derived principles for cognitive tool design; AI as cognitive policy instrument

Workflows

How the stack fits together in practice.

  • multi-agent-setup — orchestrator + subagents, parallel execution, cost patterns
  • overnight-runs — unattended sessions on exe.dev, hook notifications, task design

Music

Artists, sounds, and the aesthetic logic behind them.

  • don-toliver — psychedelic trap · dark luxury · autotune as instrument · Cactus Jack
  • julia-wolf — alt-pop · cinematic · emotionally vulnerable · feminine angst
  • dom-dolla — tech house · groove-first · club records · Melbourne

Blockchain Infrastructure

Ethereum scaling, rollup architecture, and the protocols beneath the L2 ecosystem.

  • op-stack — Optimism's modular L2 framework; powers Base, Zora, Unichain, and the Superchain
  • optimism — OP Mainnet, OP token, bicameral governance, RetroPGF
  • rollups — optimistic vs ZK, fraud proofs vs validity proofs, L2BEAT staging
  • ethereum-l2s — comparison of Arbitrum, Base, OP Mainnet, zkSync, StarkNet by TVL and approach
  • superchain — shared sequencer vision, cross-chain interop, Superchain fee flywheel
  • data-availability — calldata, EIP-4844 blobs, Celestia, EigenDA — where rollup data lives
  • eigenlayer — restaking, AVSs, $18B+ economic security sharing, verifiable cloud direction
  • base — Coinbase's L2, onchain economy thesis, 89% of Superchain revenue in 2025