Roguelite Labs Wiki
A working knowledge base for AI tools, models, and agentic workflows.
Companion to roguelitelabs.xyz — browse /stack and /log on the main site.
Models
Capability profiles, benchmarks, pricing, and honest assessments.
- claude-opus-4-8 — 82.3% OSWorld-Verified, adaptive thinking, dynamic parallel subagents, current Anthropic flagship
- claude-sonnet-4-6 — 79.6% SWE-bench, 72.7% OSWorld, current agentic default
- claude-opus-4-6 — 128K output tokens, orchestrator role, legacy
- gpt-5-5 — 82.7% Terminal-Bench 2.0, agentic-first design, dominant ecosystem
- gpt-5-4 — three-tier system (instant/thinking/pro)
- gemini-3-5-flash — 76.2% Terminal-Bench 2.1, 4× faster inference, $1.50/$9 pricing
- gemini-2-5-pro — best long-context, leads coding benchmarks, free via AI Studio
- gemma-4 — 31B Apache 2.0, 89.2% AIME 2026, #3 open model
- deepseek-r1 — 671B MIT license, 97.3% MATH-500, best open math reasoning
- grok-4 — best current-events accuracy, deep X data integration
- kimi-k2 — 1T MoE from Moonshot AI, MIT license, leading open-weights Intelligence Index score
- claude-3-5-sonnet — 49.0% SWE-bench Verified (Oct update), first computer use in public beta, $3/$15 pricing
- gpt-4o — natively omnimodal (text/audio/image/video), ~320ms voice latency, 128K context, $2.50/$10 pricing
- o1 — chain-of-thought reasoning model, 97% MATH-500, 77.3% GPQA Diamond, first to beat PhD experts
- llama-4 — Scout (17B/109B MoE, 10M context) and Maverick (17B/400B MoE), natively multimodal, April 2025
- claude-haiku-4-5 — $1/$5 pricing, 73.3% SWE-bench, 2× faster than Sonnet 4.5, computer use + extended thinking
- gemini-2-5-ultra — Gemini 2.5 Pro + Deep Think reasoning mode; Ultra = subscription tier, not a separate model
- llama-3-1 — 8B/70B/405B open-weights, 128K context, commercial license, first open frontier-class model
- claude-3-7-sonnet — first hybrid reasoning model, introduced extended thinking, superseded by 4.x
- nvidia-nemotron-3-ultra — 550B MoE, 48 AA Intelligence Index, #1 US open-weights, 1M context, 300+ tok/s
- microsoft-mai-thinking-1 — ~1T MoE, 97.0% AIME 2025, commercially licensed training, Maia 200 inference
- minimax-m3 — MSA architecture, 1M context, 59.0% SWE-Bench Pro, 9×/15× prefill/decode speedup vs M2
Tools
Agents, editors, and runtimes that make up a working AI stack.
- claude-code — agentic CLI, hook system, context compaction
- codex-cli — OpenAI terminal agent, sandboxed execution, open-source
- cursor — AI-native VS Code fork, $100M ARR, large codebase navigation
- amp-code — Sourcegraph agent, think-out-loud mode, multi-model
- ollama — local inference CLI, pipeline-friendly, OpenAI-compatible API
- lm-studio — GUI local model runner, good Apple Silicon support
- exe-dev — remote VMs for overnight Claude Code sessions
- perplexity-comet — AI-native browser, agent executes web tasks autonomously, first consumer computer-use product
- windsurf — VS Code fork by Codeium, Cascade flow-based agent, acquired by Cognition July 2025
- cline — open-source VS Code agent extension, model-agnostic, Plan and Act mode, no subscription
- github-copilot — Microsoft/GitHub assistant, Workspace browser agent, Project Polaris model, 20M+ users
Concepts
The underlying ideas, protocols, and benchmarks worth understanding.
- extended-thinking — chain-of-thought interleaved with tool use, first in Claude 3.7
- reasoning-models — thinking tokens, o1/o3/R1, trade-offs vs fast models, benchmarks
- tool-use — structured function calling, client vs server tools, agentic loops
- prompt-caching — cache static prefixes, 90% cost reduction on cache reads
- context-window — token limits, 1M context, lost-in-the-middle problem
- computer-use — model sees screen, moves cursor, clicks; 72.7% OSWorld
- mcp — Model Context Protocol, standard for tool/context integration
- vibe-coding — Karpathy Feb 2025, iterate on behavior not implementation
- evals — SWE-bench, OSWorld, AIME, LiveCodeBench explained
- rag — retrieval augmented generation, grounding in external documents
- agentic-workflows — multi-step autonomous agent loops, patterns and failure modes
- language-of-thought — Fodor's Mentalese hypothesis; systematicity, compositionality, and what reasoning requires
- lot-llm-paradox — three positions on whether LLMs reason; the system-level synthesis
- linguistic-relativity — Sapir-Whorf, Boroditsky, Pica on number words; language as cognitive infrastructure
- future-time-reference — Chen (2013) on FTR grammar and savings behavior; the AI collaboration implication
- extended-cognition — EC vs Extended Mind; why the distinction matters; literacy as coupling mechanism
- distributed-cognition — Hutchins; cognitive processes as system-level properties; ship navigation and cockpit studies
- tools-for-thought — Bush → Engelbart → Kay → Victor → Nielsen/Matuschak; augmentation vs automation
- hci-ai — Norman's gulfs, Suchman's situated action, Endsley's situation awareness; HCI in the AI paradigm
- assistive-technology — the AT evolution arc, curb cut effect, AAC; scaffold vs substitute
- ephemeral-software — on-demand generated tools; specification as the new development bottleneck
- personalized-systems — Licklider's symbiosis vision, PKM, adaptive cognitive scaffolding
- universal-design-cognition — six derived principles for cognitive tool design; AI as cognitive policy instrument
Workflows
How the stack fits together in practice.
- multi-agent-setup — orchestrator + subagents, parallel execution, cost patterns
- overnight-runs — unattended sessions on exe.dev, hook notifications, task design
Music
Artists, sounds, and the aesthetic logic behind them.
- don-toliver — psychedelic trap · dark luxury · autotune as instrument · Cactus Jack
- julia-wolf — alt-pop · cinematic · emotionally vulnerable · feminine angst
- dom-dolla — tech house · groove-first · club records · Melbourne
Blockchain Infrastructure
Ethereum scaling, rollup architecture, and the protocols beneath the L2 ecosystem.
- op-stack — Optimism's modular L2 framework; powers Base, Zora, Unichain, and the Superchain
- optimism — OP Mainnet, OP token, bicameral governance, RetroPGF
- rollups — optimistic vs ZK, fraud proofs vs validity proofs, L2BEAT staging
- ethereum-l2s — comparison of Arbitrum, Base, OP Mainnet, zkSync, StarkNet by TVL and approach
- superchain — shared sequencer vision, cross-chain interop, Superchain fee flywheel
- data-availability — calldata, EIP-4844 blobs, Celestia, EigenDA — where rollup data lives
- eigenlayer — restaking, AVSs, $18B+ economic security sharing, verifiable cloud direction
- base — Coinbase's L2, onchain economy thesis, 89% of Superchain revenue in 2025
blockchain8
concepts23
Agentic Workflows570wAssistive Technology1,629wComputer Use420wContext Window578wDistributed Cognition1,263wEphemeral Software1,235wEvals (Benchmarks)467wExtended Cognition1,786wExtended Thinking419wFuture-Time Reference935wHCI in the Age of AI1,586wLanguage of Thought649wLinguistic Relativity979wMCP (Model Context Protocol)454wPersonalized Systems1,218wPrompt Caching461wRAG (Retrieval Augmented Generation)443wReasoning Models557wThe LOT-LLM Paradox866wTool Use567wTools for Thought1,503wUniversal Design for Cognition1,419wVibe Coding436w
models22
Claude 3.5 Sonnet491wClaude 3.7 Sonnet489wClaude Haiku 4.5314wClaude Opus 4.6419wClaude Opus 4.8474wClaude Sonnet 4.6334wDeepSeek R1359wGemini 2.5 Pro331wGemini 2.5 Ultra477wGemini 3.5 Flash484wGemma 4348wGPT-4o414wGPT-5.4335wGPT-5.5502wGrok 4426wKimi K2426wLlama 3.1457wLlama 4507wMicrosoft MAI-Thinking-1659wMiniMax M3600wNVIDIA Nemotron 3 Ultra553wOpenAI o1442w
tools11
workflows2