Owning Your Stack: Local LLMs Plus Frontier Models, On Your Terms
A privacy-first architecture for running sensitive AI workloads locally on Apple Silicon while selectively routing to frontier models, with a redaction pipeline, cost breakdown, and honest assessment of where local inference still falls short.
Back in 2019 I was building a client portal for Cigna and the PM kept pasting draft requirement docs into ChatGPT to get summaries. I didn't say anything at the time. The docs had internal product names, user flow diagrams, a few references to claims data structures. None of it was technically classified, but none of it was meant to leave the building either. That moment stuck with me more than I expected. By the time I started seriously designing nClaw in 2025, it had become the architectural premise.
Every time you paste a file path, a function signature, or a design doc into a cloud chat interface, that content travels to a server you don't control, gets processed by infrastructure governed by a terms-of-service agreement you probably skimmed once, and may be retained in ways that depend entirely on the provider's current data policy. For most prompts that's fine. For the ones that matter (internal architecture decisions, unreleased product specs, code that touches authentication or billing, anything with a secret or a customer reference in it) that's a meaningful risk most developers have simply normalized.
I've been building nClaw, a personal AI workstation that runs as a plugin on nSelf, for about a year. The original motivation was mundane: I wanted to recall conversations from six months ago without hunting through browser tabs. But the deeper I got into the architecture, the more I realized the retrieval problem and the privacy problem are the same problem. If your memory lives in someone else's cloud, you don't actually own it.
This post covers the threat model, the architecture that came out of thinking through it carefully, and where local inference is still genuinely outclassed by frontier models. I want to be direct about all three.
Cloud LLM providers are, structurally, a data exfiltration channel. That's a sharp way to put it, but it's the accurate framing. When you send a prompt to any cloud model (Claude, GPT-5.4, Gemini, whatever) you're transmitting data outside your network boundary. The provider receives it. Whether it's used for training depends on your plan tier and their current policy. Whether it's retained depends on the same. Whether a future acquisition, policy change, or breach exposes it is outside your control entirely. Most enterprise agreements offer contractual protections. Most individual and small-team subscriptions do not.
The categories of data that create real exposure are more specific than people tend to think. Generic code questions are low risk. Stack traces that include internal hostnames, secrets in environment variables you forgot to scrub, architecture discussions that describe how your auth system works, internal roadmap documents you pasted for summarization: those carry actual risk. The threat isn't necessarily malicious; it's structural. You handed data to a party outside your boundary, and you've accepted whatever fate that party assigns to it.
The sovereignty argument isn't just about security. It's about audit. When I run a model locally, every query, every context document, every response, and every latency figure goes into a Postgres table I control. I can query it, audit it, grep it, and correlate it with anything else in my system. When a cloud model processes a request, I get a response. The intermediate state (what was retrieved, what was weighted, what was nearly said but wasn't) is invisible to me.
For nClaw, I designed the memory layer around this directly. Postgres is the only authoritative store: conversations, facts, entities, project state, routing metadata, and audit trails all go there. The vector store runs inside the same Postgres instance via pgvector. Full-text search uses generated tsvector columns with GIN indexes. The audit log is just another table. Nothing about the memory architecture requires trusting a third party.
The practical answer to the privacy problem isn't to abandon frontier models entirely. That would be a quality regression I'm not willing to accept for reasoning-heavy tasks. The answer is a routing layer that makes deliberate decisions about what goes where.
The hybrid pattern I've settled on has two inference planes. The local plane runs on Apple Silicon via MLX-LM, which has official support for running and quantizing models on Apple hardware. The frontier plane routes to Opus 4.7 or GPT-5.4 for tasks that genuinely need frontier-class reasoning, but only after a redaction pass that strips secrets, PII, and internal identifiers from the context before it leaves the machine.
The key design principle is that the routing decision happens before inference, not after. You're not running the prompt locally and then re-running it on a frontier model if the local result is unsatisfactory. That would double your exposure and your cost. You classify the prompt, decide where it goes, redact if necessary, and route once. The orchestrator in nClaw does this in a pre_route hook that runs deterministically: intent classification, privacy risk score, context size estimate, and model selection all happen in under 200 milliseconds on local hardware.
The local model stack I use for nClaw is built around Qwen3 dense models. For general assistant work and moderate reasoning, Qwen3-14B in FP8 runs comfortably on Apple Silicon with the thinking toggle enabled: 14.8B parameters, 32K native context, 131K with YaRN extension. For heavier local reasoning, Qwen3-32B-AWQ at 4-bit quantization handles tasks where the 14B doesn't have enough headroom, with the same thinking toggle and a 131K YaRN context ceiling. For agentic coding specifically, Qwen3-Coder-30B-A3B is the right model: 30.5B total parameters with only 3.3B activated per token (it's a mixture-of-experts architecture), 256K native context, and explicit design orientation toward tool use and repo-scale code tasks.
The quantization tradeoffs matter here and they're worth being specific about. FP8 is the most accurate post-training quantization format that still fits comfortably on Apple Silicon unified memory. The 14B FP8 model uses roughly 15 GB of weight RAM. Q4 (4-bit) cuts that to around 9-11 GB for the same model but introduces quality degradation that shows up most clearly on structured reasoning and long-context coherence. Q8 is a middle ground that preserves most of FP8 quality at a moderate memory savings. For the coding model at 30B, 4-bit class deployments land around 18-22 GB, which still fits in a 48 GB Apple Silicon configuration alongside embeddings and a reranker.
MLX-LM's official Apple Silicon support isn't a hack or a port. It's first-class. The framework compiles directly to the Metal Performance Shaders stack, uses the unified memory architecture efficiently, and supports model serving with streaming. Running a 30B coding model on a Mac Studio or Mac Pro is a supported production workflow, not an experiment.
When the local workstation is offline or under heavy load, I route to vLLM or SGLang on a dedicated GPU node. vLLM's PagedAttention handles KV cache efficiently enough that it makes sense for longer-context generation tasks that would be slow on CPU-bound Apple Silicon. SGLang adds structured output and speculative decoding on top of similar efficiency gains. Both are OpenAI-compatible on the serving side, which means the routing layer doesn't need to know which backend it's talking to.
The routing decision matrix is where the architecture actually comes together. The key insight is that most prompts don't need frontier-class reasoning. Classification, intent detection, code completion within a known codebase, summarization of documents you've already indexed locally, retrieval-augmented question answering over your own notes: all of these are tasks where a well-quantized local 14B or 30B model performs well enough that the quality difference versus GPT-5.4 or Opus 4.7 is undetectable in practice.
Here's the matrix I've implemented in nClaw's orchestrator. Classification tasks go local unconditionally. They're cheap, fast, and don't benefit from frontier reasoning. Code completion within a repo context you've already indexed goes local, using the Qwen3-Coder model with the 256K window to hold the relevant files. Summarization of documents you own goes local. Anything touching credentials, internal hostnames, or customer data goes local and stays local.
Deep architectural reasoning routes to a frontier model with a redacted context. I'm talking about the kind where you're genuinely uncertain and the stakes of a wrong answer are high: new library evaluation, security review of an authentication design, cross-system impact analysis for a schema migration. These benefit from frontier-scale training and I'm willing to pay for them, but I'm not willing to send the actual schema or the actual service names. The redaction pipeline runs first.
Code generation for greenfield features routes to frontier with redaction. The quality gap between a 30B local model and Claude Sonnet on complex multi-file generation is real and meaningful. I haven't trained a local model on my codebase and I don't plan to for now. So I accept the quality tradeoff, but I control what context gets sent.
The data redaction pipeline is the linchpin. Before any prompt crosses the boundary to a frontier provider, it passes through a local preprocessing step that does three things. First, secret detection: regex patterns for common secret formats (API keys, tokens, connection strings, PEM-encoded material) get replaced with placeholder strings. Second, PII scrubbing: named entity recognition run locally identifies person names, email addresses, phone numbers, and internal hostnames, replacing them with type-labelled tokens. Third, internal identifier normalization: service names, database names, and internal URLs get mapped to generic equivalents from a local dictionary.
The result is a prompt that retains the reasoning structure (the architecture question, the code pattern, the logic flow) while stripping the specifics that carry actual risk. A frontier model can still answer the structural question accurately. It just can't accidentally expose the internals it was never supposed to see.
I keep the redaction dictionary and the entity recognition models local and version-controlled. Every redaction event gets logged to the audit table with the original hash, the redacted form, and a timestamp. If something slips through, I can see it.
The cost analysis comes down to CapEx versus OpEx and where you set the breakeven. Frontier API calls at Claude Sonnet pricing run roughly $3 per million input tokens and $15 per million output tokens as of early 2025. A productive developer day touching 2-3 million tokens of reasoning work costs $10-40 in API fees. That sounds cheap until you're running an agentic system like nClaw or ClawDE that issues hundreds of sub-requests per session. At that scale, you're looking at $100-300 per day for heavy agentic work, or $2,000-6,000 per month.
A Mac Studio with an M4 Ultra at 512 GB unified memory runs around $10,000 upfront. Amortized over three years, that's roughly $280 per month in hardware cost. Power draw for sustained inference workloads on Apple Silicon runs considerably below a comparable GPU setup. Apple's efficiency architecture is genuinely better here. The crossover point where local inference is cheaper than API calls is somewhere between two and four months of heavy agentic use. After that, every local inference is effectively free.
The caveat is that not all of your inference budget moves local. Tasks that genuinely need frontier reasoning stay on frontier, so the API bill doesn't go to zero. A realistic hybrid budget for heavy agentic development is probably $200-400 per month in API fees (down from $2,000-6,000) plus the amortized hardware cost. That's a meaningful reduction, and the privacy properties are categorically different.
The vector store economics are simple: Postgres with pgvector on your own hardware has no per-query cost. At the scale of a personal knowledge base (hundreds of thousands of embeddings) hosted vector database pricing is manageable but nonzero. More importantly, the queries are local-latency instead of network-latency, which matters for the interactive retrieval loop in something like ClawDE.
One wrinkle worth mentioning: KV cache economics. Qwen3-Coder at 256K context sounds like it solves the long-context problem, but 256K is mostly a KV cache problem, not a weights problem. In BF16, the KV cache for the 30B coder model runs approximately 94 MB per 1,000 tokens. At 256K tokens that's around 24 GB of KV cache alone, before the 18-22 GB of weights. On a 48 GB Apple Silicon machine you're tight. On 96 GB or 128 GB you have real room. This is why I treat long native context as a scarce resource and let Postgres hybrid retrieval do the work of selecting what actually goes into the window. You want 8-16K of high-signal retrieved context, not 200K of everything.
I want to be honest about where local models still don't match frontier quality. The gap is real and it would be misleading to paper over it.
Cross-domain synthesis is the clearest gap. A task like "review this pull request from a security, performance, and correctness standpoint and identify the highest-risk change" is asking a model to hold multiple expert mental models simultaneously and reason about their interactions. Qwen3-32B does this adequately. Claude Sonnet does it noticeably better, and Claude Opus does it better still. The difference isn't dramatic on any single dimension. It's the coherence of the synthesis that separates them.
Novel problem reasoning is the second gap. When you're working in genuinely unfamiliar territory (a protocol you haven't implemented before, a security model you're designing from scratch, an architecture tradeoff in a domain where you don't have strong priors) the frontier models have seen more training data covering more edge cases. They make better initial suggestions and catch more categories of mistake. A 30B local model is good at what it knows. The frontier models have a wider spread.
Long-horizon instruction following is the third gap. Agentic tasks that require holding a complex goal across twenty or thirty steps, recovering gracefully from partial failures, and adjusting the plan based on intermediate results: these are harder for local models. The Qwen3-Coder 256K context window helps significantly for coding agents, but the model's ability to track implicit state across a long task loop still trails frontier models in my experience.
None of these gaps are reasons to route everything to frontier. They're reasons to route the right things to frontier, with a redaction pipeline, while running everything else locally. The routing layer is the product. The models on both sides are just the current instantiation of it.
There's also a fourth gap that gets less attention: instruction following under constraint. When a frontier model is told to output JSON matching a schema, it does. When a local 14B model is told the same, it mostly does, with occasional drift that requires retry logic. Structured output is improving fast across all model families and vLLM's constrained generation handles this well with grammar-based sampling, but it's worth knowing that your agentic pipeline needs more defensive parsing when the local model is on the other end.
The architecture I've described here is what nClaw implements, and it's what ClawDE uses as the inference backend for agentic coding sessions. The vector store is Postgres. The routing rules are stored as rows in a model registry table, versioned and inspectable. The audit log captures every request, its classification, where it was routed, and whether it was redacted. The local models run on Apple Silicon via MLX-LM. The frontier calls go out with sanitized context.
The thing that changed for me when I started building this way is that AI feels like a tool rather than a service. A tool you own, that runs on your hardware, that logs to your database, and that you can inspect, debug, and modify. The service model is more convenient for getting started. The tool model is better for everything that actually matters.
Your memory is yours. The architecture should match that.