Own Your AI — Local Models with Ollama + OpenWork — cheat sheet
The one idea: You don't need anyone's cloud. Run a capable open-weight model on your own machine (Ollama), point an agent at your own files (OpenWork), and keep your data yours.
Ship tonight: Ollama serving a local model + an OpenWork agent working over one of your own folders, plus one skill you share with a single link.
Quick-start (two installs, ~15 minutes)
1. Ollama — the engine
- Install from
ollama.com/download(macOS / Windows / Linux). On macOS/Linux you can also runcurl -fsSL https://ollama.com/install.sh | sh. - Pull + run a model:
ollama pull qwen3:8bthenollama run qwen3:8bto chat in the terminal. - It serves an OpenAI-compatible API at
http://localhost:11434/v1— that's the endpoint other apps point at.ollama serveruns it; the desktop app starts it for you.
2. OpenWork — the cockpit
- Download the free, open-source desktop app from
openworklabs.com(MIT-licensed; built by Different AI, YC-backed). No terminal needed. - First run: connect a model provider, give it access to a folder, send your first message.
- To use your local Ollama model, add a provider in your workspace config (
.config/opencode/opencode.json):
``json { "provider": { "ollama": { "npm": "@ai-sdk/openai-compatible", "name": "Ollama", "options": { "baseURL": "http://localhost:11434/v1" }, "models": { "qwen3:8b": { "name": "Qwen3 8B" } } } } } ``
- Then select Ollama → Qwen3 8B as the model. 100% local — nothing leaves your machine.
Models to try (current, mid-2026)
| Model | Sizes | Rough RAM (Q4) | Good for | | --- | --- | --- | --- | | Qwen3.5 | 4B–480B | 9B ≈ 8 GB | Strong all-rounder; great default for agents | | Granite 4.1 (IBM) | 3B / 8B / 30B | 8B ≈ 8 GB | Efficient dense workhorse, solid tool use | | Gemma 4 (Google) | E2B–31B | E4B ≈ 10 GB | On-device, audio-capable, 140+ languages | | GLM-4.7-Flash (Z.ai) | 30B-A3B | ≈ 18 GB | Coding- and agent-first, huge context | | gpt-oss (OpenAI) | 20B / 120B | 20B ≈ 16 GB+ | Reasoning + agentic (needs a beefier box) | | DeepSeek-R1 | 1.5B–70B distill | 8B ≈ 8 GB | Step-by-step reasoning | | Phi-4 (Microsoft) | 14B | ≈ 12 GB | Compact, capable, good per-GB | | Mistral 3 | 3B / 8B / 24B | 8B ≈ 7 GB | Lean Apache-2.0 workhorse |
Why this generation is interesting: Granite 4.0 mixed in Mamba-2 state-space layers to cut long-context memory ~70% (4.1 reverted to pure dense — hybrid isn't a free win yet); low-active-parameter MoE (GLM, Qwen, DeepSeek) gives big capacity at modest compute; Gemma 4 brings multimodal — including audio — fully on-device. The frontier open models (GLM-5.2, DeepSeek-V4, Qwen3-Coder-480B) are now too big for a laptop — which is exactly what Ollama Cloud is for.
Rule of thumb at Q4 quantization: 8 GB → 7–8B, 16 GB → 13–14B, 24 GB+ → 20–32B.
Local vs cloud — quick decision
- Go local when: the data is sensitive, you want $0 per-token cost, you need offline, or you're doing high-volume repetitive work.
- Reach for cloud when: you need frontier quality, a very large context window, or your laptop can't fit the model. Hybrid is fine — local by default, frontier key for the hard parts.
Run a big model in the cloud — same CLI
When a model is too big for your laptop, Ollama Cloud runs it on a datacenter GPU through the exact same commands:
``bash ollama signin # free tier works; Pro is $20/mo ollama run glm-5.2:cloud # a 744B model, no local RAM needed ``
Other current cloud tags: qwen3-coder:480b-cloud, gpt-oss:120b-cloud, deepseek-v3.1:671b-cloud. Cloud tags rotate fast — check ollama.com/search?c=cloud before you rely on one. Ollama states it doesn't retain your data.
Hit Ollama from code
Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 — point any OpenAI SDK at it with a dummy key:
```python from openai import OpenAI
client = OpenAI(baseurl="http://localhost:11434/v1/", apikey="ollama") # key is ignored r = client.chat.completions.create( model="qwen3:8b", messages=[{"role": "user", "content": "Say this is a test"}], ) print(r.choices[0].message.content) ```
Or curl the native API: curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"Why is the sky blue?","stream":false}'. First-party libraries: pip install ollama / npm install ollama.
Drive Claude Code on a local model
```bash ollama launch claude # easiest — built-in launcher
…or point Claude Code at Ollama manually:
export ANTHROPICBASEURL=http://localhost:11434 export ANTHROPICAUTHTOKEN=ollama export ANTHROPICAPIKEY="" claude --model qwen3.5 ```
Claude Code needs a large context window — 64k+ — so use a roomy model (qwen3.5, glm-4.7-flash, or a :cloud tag).
AI in the browser (WebAI)
Run models in a tab via WebGPU — no install, no server, nothing leaves the page:
- Transformers.js (Hugging Face) — any HF model in-browser: Whisper transcription, embeddings / semantic search, background removal. Set
device: 'webgpu'. - MediaPipe LLM Inference (Google) — on-device Gemma chat via a high-level task API (now maintenance-mode; successor is LiteRT-LM).
- ONNX Runtime Web (Microsoft) — production in-browser inference; also the engine under Transformers.js.
- Bonus: WebLLM (full chat LLMs, OpenAI-compatible) and Chrome's built-in Prompt API / Gemini Nano.
Fine-tune your own (teaser)
- LoRA / QLoRA — train a small adapter, not the whole model; a small model fine-tunes on one consumer GPU in under an hour.
- Beginner-friendly tools: Unsloth (fastest), Axolotl (YAML config), MLX-LM (Apple Silicon); Hugging Face TRL for more control.
- Then run it in Ollama: merge the adapter → export to GGUF → write a one-line Modelfile (
FROM model.gguf, orADAPTER ./adapter) →ollama create my-model. Use the same base you trained against.
Top gotchas + fixes
- Out of memory / crawling slow → the model is too big for your RAM. Drop to a smaller model or a more-quantized tag (e.g.
:8b→ smaller, or aq4variant). - Quality feels off vs ChatGPT → that's quantization + a smaller model, not a bug. Step up a size if RAM allows, or use hybrid for the hard asks.
- OpenWork can't see the model → make sure
ollama serveis running and the model is pulled; baseURL must behttp://localhost:11434/v1(note the/v1). - First answer is slow → the model loads into RAM on first call, then warms up. Subsequent calls are faster.
- Hybrid key handling → cloud keys live in OpenWork's provider settings, not in your shared link. Don't paste keys into skills you share.
Glossary
- Open-weight model — the trained weights are published, so anyone can download and run it. Free to run; you own the copy.
- Quantization (Q4, etc.) — shrinking the weights to fit in less RAM. Smaller = faster + lower RAM, slightly lower quality.
- Ollama — a one-click runner that downloads open-weight models and serves them on
localhost:11434. - OpenWork — an open-source desktop agent that drives a model (local or cloud) over your own files, browser, and tools.
- Skill — a reusable instruction/setup in OpenWork you can package and share as a single link.
- MCP server — a standard way to give an agent extra tools (files, search, APIs) it can call.
Template library
Copy a starting point, paste it into your assistant's instructions, then make it yours.
Point a fully-local agent at a folder and let it find, summarize, and organize.
You are my local file assistant. You work entirely on my machine — nothing leaves it. Scope: only read and act on files inside the folder I've given you access to. Voice: plain and direct. No filler. What you do: when I ask, find the right files, summarize them, pull out the facts I need, and draft notes or replies grounded in what's actually in the folder. Rules: never invent file contents or facts. If something isn't in the folder, say so. Quote the filename you used. Ask first: before renaming, moving, or overwriting anything, show me the plan and wait for a yes. When I give you a task, start by listing which files you'll look at.
A reusable OpenWork skill you can package and hand to a teammate as one link.
Skill: Weekly Digest. Purpose: read everything added to this folder in the last 7 days and write a one-page digest. Output: a short summary up top (3–5 bullets), then a per-file list — filename, one line on what changed or what it says, and any action it implies. Rules: work only from the files present. Flag anything you couldn't open. Don't include files older than 7 days. Privacy: this skill is meant to run on a local model. It contains no API keys and no private data — only instructions. Run this whenever I say "weekly digest" and tell me which folder you used.