Ollama

Ollama

Local model inference runtime. CLI-first design, single command to pull and run any supported open-weight model. The default choice for scripting, automation pipelines, and local development when you don't want API costs or network dependency.

Core Usage

ollama run llama3          # pull and run interactively
ollama run deepseek-r1:7b  # specific variant
ollama pull gemma4:27b     # pre-download without running
ollama list                # show local models

API runs locally at http://localhost:11434. OpenAI-compatible endpoint at http://localhost:11434/v1/chat/completions — most tooling that targets OpenAI's API works with Ollama by changing the base URL.

Why CLI-First Matters

Ollama is designed to be scripted. Unlike lm-studio (which is GUI-oriented), Ollama integrates cleanly into shell scripts, cron jobs, and pipelines:

echo "summarize this" | ollama run mistral
ollama run codellama --stdin < my_file.py

This makes it practical for overnight-runs workflows, batch processing, and integration with agentic-workflows that need local inference.

Model Library

Supports most major open-weight models: Llama 3 family, Mistral, Gemma 4, deepseek-r1 distills, CodeLlama, Phi, Qwen, and more. Browse models at https://ollama.com/search. Models download from Ollama's registry (~1-30GB each depending on size).

Quantization levels available (Q4, Q5, Q8, F16) — smaller quantizations run faster with less VRAM at some quality cost.

Hardware Requirements

Rule of thumb: model parameter count × 2 bytes per parameter for F16, half that for Q8, quarter for Q4. A 7B model at Q4 needs ~4GB VRAM. A 31B (gemma-4 range) at Q4 needs ~16-20GB VRAM.

CPU inference works but is 10-50x slower than GPU. Practical for offline use or testing; not for production latency.

Installation

Linux/macOS:

curl -fsSL https://ollama.com/install.sh | sh

macOS also has a DMG installer (requires macOS 14 Sonoma or later).

Strengths

  • Zero API cost after hardware
  • Complete data privacy — nothing leaves local machine
  • OpenAI-compatible API makes integration trivial
  • Excellent for scripting and pipeline integration
  • Fast model switching — ollama run model2 works immediately

Weaknesses

  • Capped by local hardware — can't access frontier closed models
  • Setup and maintenance (drivers, VRAM management) has overhead
  • Quantized models lose some quality vs. full precision
  • No built-in server auth — treat as internal service, don't expose publicly

Related

lm-studio · gemma-4 · deepseek-r1 · agentic-workflows · overnight-runs

Sources