Advanced AI Build Night · Local Models · Cheat sheet

Own Your AI

Own Your AI — Local Models with Ollama + OpenWork — cheat sheet

The one idea: You don't need anyone's cloud. Run a capable open-weight model on your own machine (Ollama), point an agent at your own files (OpenWork), and keep your data yours.

Ship tonight: Ollama serving a local model + an OpenWork agent working over one of your own folders, plus one skill you share with a single link.

Quick-start (two installs, ~15 minutes)

1. Ollama — the engine

Install from ollama.com/download (macOS / Windows / Linux). On macOS/Linux you can also run curl -fsSL https://ollama.com/install.sh | sh.
Pull + run a model: ollama pull qwen3:8b then ollama run qwen3:8b to chat in the terminal.
It serves an OpenAI-compatible API at http://localhost:11434/v1 — that's the endpoint other apps point at. ollama serve runs it; the desktop app starts it for you.

2. OpenWork — the cockpit

Download the free, open-source desktop app from openworklabs.com (MIT-licensed; built by Different AI, YC-backed). No terminal needed.
First run: connect a model provider, give it access to a folder, send your first message.
To use your local Ollama model, add a provider in your workspace config (.config/opencode/opencode.json):

``json { "provider": { "ollama": { "npm": "@ai-sdk/openai-compatible", "name": "Ollama", "options": { "baseURL": "http://localhost:11434/v1" }, "models": { "qwen3:8b": { "name": "Qwen3 8B" } } } } } ``

Then select Ollama → Qwen3 8B as the model. 100% local — nothing leaves your machine.

Models to try (current, mid-2026)

| Model | Sizes | Rough RAM (Q4) | Good for | | --- | --- | --- | --- | | Qwen3.5 | 4B–480B | 9B ≈ 8 GB | Strong all-rounder; great default for agents | | Granite 4.1 (IBM) | 3B / 8B / 30B | 8B ≈ 8 GB | Efficient dense workhorse, solid tool use | | Gemma 4 (Google) | E2B–31B | E4B ≈ 10 GB | On-device, audio-capable, 140+ languages | | GLM-4.7-Flash (Z.ai) | 30B-A3B | ≈ 18 GB | Coding- and agent-first, huge context | | gpt-oss (OpenAI) | 20B / 120B | 20B ≈ 16 GB+ | Reasoning + agentic (needs a beefier box) | | DeepSeek-R1 | 1.5B–70B distill | 8B ≈ 8 GB | Step-by-step reasoning | | Phi-4 (Microsoft) | 14B | ≈ 12 GB | Compact, capable, good per-GB | | Mistral 3 | 3B / 8B / 24B | 8B ≈ 7 GB | Lean Apache-2.0 workhorse |

Why this generation is interesting: Granite 4.0 mixed in Mamba-2 state-space layers to cut long-context memory ~70% (4.1 reverted to pure dense — hybrid isn't a free win yet); low-active-parameter MoE (GLM, Qwen, DeepSeek) gives big capacity at modest compute; Gemma 4 brings multimodal — including audio — fully on-device. The frontier open models (GLM-5.2, DeepSeek-V4, Qwen3-Coder-480B) are now too big for a laptop — which is exactly what Ollama Cloud is for.

Rule of thumb at Q4 quantization: 8 GB → 7–8B, 16 GB → 13–14B, 24 GB+ → 20–32B.

Local vs cloud — quick decision

Go local when: the data is sensitive, you want $0 per-token cost, you need offline, or you're doing high-volume repetitive work.
Reach for cloud when: you need frontier quality, a very large context window, or your laptop can't fit the model. Hybrid is fine — local by default, frontier key for the hard parts.

Run a big model in the cloud — same CLI

When a model is too big for your laptop, Ollama Cloud runs it on a datacenter GPU through the exact same commands:

``bash ollama signin # free tier works; Pro is $20/mo ollama run glm-5.2:cloud # a 744B model, no local RAM needed ``

Other current cloud tags: qwen3-coder:480b-cloud, gpt-oss:120b-cloud, deepseek-v3.1:671b-cloud. Cloud tags rotate fast — check ollama.com/search?c=cloud before you rely on one. Ollama states it doesn't retain your data.

Hit Ollama from code

Ollama serves an OpenAI-compatible API at http://localhost:11434/v1 — point any OpenAI SDK at it with a dummy key:

```python from openai import OpenAI

client = OpenAI(baseurl="http://localhost:11434/v1/", apikey="ollama") # key is ignored r = client.chat.completions.create( model="qwen3:8b", messages=[{"role": "user", "content": "Say this is a test"}], ) print(r.choices[0].message.content) ```

Or curl the native API: curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"Why is the sky blue?","stream":false}'. First-party libraries: pip install ollama / npm install ollama.

Drive Claude Code on a local model

```bash ollama launch claude # easiest — built-in launcher

…or point Claude Code at Ollama manually:

export ANTHROPICBASEURL=http://localhost:11434 export ANTHROPICAUTHTOKEN=ollama export ANTHROPICAPIKEY="" claude --model qwen3.5 ```

Claude Code needs a large context window — 64k+ — so use a roomy model (qwen3.5, glm-4.7-flash, or a :cloud tag).

AI in the browser (WebAI)

Run models in a tab via WebGPU — no install, no server, nothing leaves the page:

Transformers.js (Hugging Face) — any HF model in-browser: Whisper transcription, embeddings / semantic search, background removal. Set device: 'webgpu'.
MediaPipe LLM Inference (Google) — on-device Gemma chat via a high-level task API (now maintenance-mode; successor is LiteRT-LM).
ONNX Runtime Web (Microsoft) — production in-browser inference; also the engine under Transformers.js.
Bonus: WebLLM (full chat LLMs, OpenAI-compatible) and Chrome's built-in Prompt API / Gemini Nano.

Fine-tune your own (teaser)

LoRA / QLoRA — train a small adapter, not the whole model; a small model fine-tunes on one consumer GPU in under an hour.
Beginner-friendly tools: Unsloth (fastest), Axolotl (YAML config), MLX-LM (Apple Silicon); Hugging Face TRL for more control.
Then run it in Ollama: merge the adapter → export to GGUF → write a one-line Modelfile (FROM model.gguf, or ADAPTER ./adapter) → ollama create my-model. Use the same base you trained against.

Top gotchas + fixes

Out of memory / crawling slow → the model is too big for your RAM. Drop to a smaller model or a more-quantized tag (e.g. :8b → smaller, or a q4 variant).
Quality feels off vs ChatGPT → that's quantization + a smaller model, not a bug. Step up a size if RAM allows, or use hybrid for the hard asks.
OpenWork can't see the model → make sure ollama serve is running and the model is pulled; baseURL must be http://localhost:11434/v1 (note the /v1).
First answer is slow → the model loads into RAM on first call, then warms up. Subsequent calls are faster.
Hybrid key handling → cloud keys live in OpenWork's provider settings, not in your shared link. Don't paste keys into skills you share.

Glossary

Open-weight model — the trained weights are published, so anyone can download and run it. Free to run; you own the copy.
Quantization (Q4, etc.) — shrinking the weights to fit in less RAM. Smaller = faster + lower RAM, slightly lower quality.
Ollama — a one-click runner that downloads open-weight models and serves them on localhost:11434.
OpenWork — an open-source desktop agent that drives a model (local or cloud) over your own files, browser, and tools.
Skill — a reusable instruction/setup in OpenWork you can package and share as a single link.
MCP server — a standard way to give an agent extra tools (files, search, APIs) it can call.

Template library

Copy a starting point, paste it into your assistant's instructions, then make it yours.

Point a fully-local agent at a folder and let it find, summarize, and organize.

Local file librarian (OpenWork agent)

You are my local file assistant. You work entirely on my machine — nothing leaves it.

Scope: only read and act on files inside the folder I've given you access to.
Voice: plain and direct. No filler.
What you do: when I ask, find the right files, summarize them, pull out the facts I need, and draft notes or replies grounded in what's actually in the folder.
Rules: never invent file contents or facts. If something isn't in the folder, say so. Quote the filename you used.
Ask first: before renaming, moving, or overwriting anything, show me the plan and wait for a yes.

When I give you a task, start by listing which files you'll look at.

A reusable OpenWork skill you can package and hand to a teammate as one link.

Shareable skill: weekly digest

Skill: Weekly Digest.

Purpose: read everything added to this folder in the last 7 days and write a one-page digest.
Output: a short summary up top (3–5 bullets), then a per-file list — filename, one line on what changed or what it says, and any action it implies.
Rules: work only from the files present. Flag anything you couldn't open. Don't include files older than 7 days.
Privacy: this skill is meant to run on a local model. It contains no API keys and no private data — only instructions.

Run this whenever I say "weekly digest" and tell me which folder you used.