Cloud vs local LLM — the honest tradeoff

Status: Candidate — awaiting founder verification. Why this page exists: New users ask "should I use Claude or Ollama?" — the answer depends on what you're optimizing for.

TL;DR

Cloud LLMs are smarter but cost money, leak data to a provider, and require a network. Local LLMs are free and private but slower and dumber. Apiary supports both because the right answer changes depending on what you're doing right now.

The five axes

Axis             Cloud (Anthropic/OpenAI/etc.)   Local (Ollama/LM Studio)
─────────────────────────────────────────────────────────────────────────
Quality          High to state-of-the-art        Mid (7B–70B class)
Latency          Network-bound (~200-2000ms)     Local-bound (depends on GPU)
Cost per token   Real money                      Electricity only
Privacy          Provider sees your data         Stays on your machine
Availability     Requires internet               Works offline

There is no axis where cloud wins on everything. There is no axis where local wins on everything. The right pick is task-by-task.

When cloud is the right call

You're writing production code. Quality matters more than cost.
You need long context windows. Cloud models still beat local at >32K tokens.
You're prototyping. No need to download a 30GB model file just to test something.
You want the model's full capability surface. Tool use, vision, structured outputs — cloud has the polished version.

When local is the right call

You're processing sensitive data. Customer PII, source code you can't share, internal documents.
You're running thousands of small queries. A local model with no per-token cost wins on volume.
You're working offline. Plane, basement, network is down.
You're learning. Watching a local model think (token by token) teaches you more about how LLMs work than the cloud's polished output.

The Apiary stance

Apiary doesn't pick for you. The BYO-LLM gate lets you switch providers in two clicks. Configure Ollama for sensitive work, switch to Anthropic for the hard problems, switch to OpenRouter when you want one key to access everything.

The substrate doesn't care which brain you bring. The committee structure, the cortex routing, the audit log — none of it depends on the provider. The substrate IS the architecture; the LLM is just the inference engine.

A practical hybrid

A pattern that works well:

HYBRID DEPLOYMENT

  Day-to-day routing       Local Ollama (Llama 3.1 8B)
                           Free, private, fast enough for routing decisions

  Hard problems            Cloud Anthropic (Claude Opus / Sonnet)
                           When the local model says "this is beyond me"

  Bulk classification      Cloud Groq (free tier Llama)
                           When you need 1000 calls and don't want to wait

Apiary's thalamus dispatcher can route between providers based on the task. The substrate handles the difference — you don't.

On model size and quality

Bigger isn't always better, but for general intelligence tasks, parameter count correlates with capability:

7B parameters (Llama 3.1 8B, Phi-3). Good enough for routing, summarization, structured extraction. Fits on a laptop with 16GB RAM.
70B parameters (Llama 3.1 70B, Mixtral 8x22B). Approaches GPT-3.5 quality. Needs a serious GPU (24GB+ VRAM).
400B+ parameters (Claude Opus, GPT-4, Gemini Ultra). State of the art. Only available via cloud — too big for any consumer hardware.

You can run anything 70B and under locally if you have the hardware. Above that, cloud is currently the only option.

Source quotes

"Cheapest → Ollama or WebLLM (free). Easiest → OpenRouter ($1 + one key = access to everything). Best → Direct Anthropic (Claude Opus). Fastest → Groq (Llama). Most private → Local (Ollama / WebLLM)."