Cloud vs local LLM — the honest tradeoff
Status: Candidate — awaiting founder verification. Why this page exists: New users ask "should I use Claude or Ollama?" — the answer depends on what you're optimizing for.
TL;DR
Cloud LLMs are smarter but cost money, leak data to a provider, and require a network. Local LLMs are free and private but slower and dumber. Apiary supports both because the right answer changes depending on what you're doing right now.
The five axes
Axis Cloud (Anthropic/OpenAI/etc.) Local (Ollama/LM Studio)
─────────────────────────────────────────────────────────────────────────
Quality High to state-of-the-art Mid (7B–70B class)
Latency Network-bound (~200-2000ms) Local-bound (depends on GPU)
Cost per token Real money Electricity only
Privacy Provider sees your data Stays on your machine
Availability Requires internet Works offlineThere is no axis where cloud wins on everything. There is no axis where local wins on everything. The right pick is task-by-task.
When cloud is the right call
- You're writing production code. Quality matters more than cost.
- You need long context windows. Cloud models still beat local at >32K tokens.
- You're prototyping. No need to download a 30GB model file just to test something.
- You want the model's full capability surface. Tool use, vision, structured outputs — cloud has the polished version.
When local is the right call
- You're processing sensitive data. Customer PII, source code you can't share, internal documents.
- You're running thousands of small queries. A local model with no per-token cost wins on volume.
- You're working offline. Plane, basement, network is down.
- You're learning. Watching a local model think (token by token) teaches you more about how LLMs work than the cloud's polished output.
The Apiary stance
Apiary doesn't pick for you. The BYO-LLM gate lets you switch providers in two clicks. Configure Ollama for sensitive work, switch to Anthropic for the hard problems, switch to OpenRouter when you want one key to access everything.
The substrate doesn't care which brain you bring. The committee structure, the cortex routing, the audit log — none of it depends on the provider. The substrate IS the architecture; the LLM is just the inference engine.
A practical hybrid
A pattern that works well:
HYBRID DEPLOYMENT
Day-to-day routing Local Ollama (Llama 3.1 8B)
Free, private, fast enough for routing decisions
Hard problems Cloud Anthropic (Claude Opus / Sonnet)
When the local model says "this is beyond me"
Bulk classification Cloud Groq (free tier Llama)
When you need 1000 calls and don't want to waitApiary's thalamus dispatcher can route between providers based on the task. The substrate handles the difference — you don't.
On model size and quality
Bigger isn't always better, but for general intelligence tasks, parameter count correlates with capability:
- 7B parameters (Llama 3.1 8B, Phi-3). Good enough for routing, summarization, structured extraction. Fits on a laptop with 16GB RAM.
- 70B parameters (Llama 3.1 70B, Mixtral 8x22B). Approaches GPT-3.5 quality. Needs a serious GPU (24GB+ VRAM).
- 400B+ parameters (Claude Opus, GPT-4, Gemini Ultra). State of the art. Only available via cloud — too big for any consumer hardware.
You can run anything 70B and under locally if you have the hardware. Above that, cloud is currently the only option.
Related
Source quotes
"Cheapest → Ollama or WebLLM (free). Easiest → OpenRouter ($1 + one key = access to everything). Best → Direct Anthropic (Claude Opus). Fastest → Groq (Llama). Most private → Local (Ollama / WebLLM)."