Ollama
Ollama
Local model inference runtime. CLI-first design, single command to pull and run any supported open-weight model. The default choice for scripting, automation pipelines, and local development when you don't want API costs or network dependency.
Core Usage
ollama run llama3 # pull and run interactively
ollama run deepseek-r1:7b # specific variant
ollama pull gemma4:27b # pre-download without running
ollama list # show local models
API runs locally at http://localhost:11434. OpenAI-compatible endpoint at http://localhost:11434/v1/chat/completions — most tooling that targets OpenAI's API works with Ollama by changing the base URL.
Why CLI-First Matters
Ollama is designed to be scripted. Unlike lm-studio (which is GUI-oriented), Ollama integrates cleanly into shell scripts, cron jobs, and pipelines:
echo "summarize this" | ollama run mistral
ollama run codellama --stdin < my_file.py
This makes it practical for overnight-runs workflows, batch processing, and integration with agentic-workflows that need local inference.
Model Library
Supports most major open-weight models: Llama 3 family, Mistral, Gemma 4, deepseek-r1 distills, CodeLlama, Phi, Qwen, and more. Browse models at https://ollama.com/search. Models download from Ollama's registry (~1-30GB each depending on size).
Quantization levels available (Q4, Q5, Q8, F16) — smaller quantizations run faster with less VRAM at some quality cost.
Hardware Requirements
Rule of thumb: model parameter count × 2 bytes per parameter for F16, half that for Q8, quarter for Q4. A 7B model at Q4 needs ~4GB VRAM. A 31B (gemma-4 range) at Q4 needs ~16-20GB VRAM.
CPU inference works but is 10-50x slower than GPU. Practical for offline use or testing; not for production latency.
Installation
Linux/macOS:
curl -fsSL https://ollama.com/install.sh | sh
macOS also has a DMG installer (requires macOS 14 Sonoma or later).
Strengths
- Zero API cost after hardware
- Complete data privacy — nothing leaves local machine
- OpenAI-compatible API makes integration trivial
- Excellent for scripting and pipeline integration
- Fast model switching —
ollama run model2works immediately
Weaknesses
- Capped by local hardware — can't access frontier closed models
- Setup and maintenance (drivers, VRAM management) has overhead
- Quantized models lose some quality vs. full precision
- No built-in server auth — treat as internal service, don't expose publicly
Related
lm-studio · gemma-4 · deepseek-r1 · agentic-workflows · overnight-runs