The Multi-AI Router: Stop Sending Every Question to GPT-5.4 or Opus 4.7

For the first year I built with LLMs, every request went to the same model. It was the obvious path. One model, one API, one bill. Opus 4.7 for the user question about their project notes, Opus 4.7 for extracting a phone number from an email, Opus 4.7 for checking whether a code snippet compiles. It worked. It was also wasteful in a way I didn't fully appreciate until I started building nClaw.

nClaw is my personal AI workstation. It handles research threads, email ingestion, document retrieval, and agentic coding tasks, all running through a single OpenAI-compatible gateway. When I started tracking token spend per request type, the distribution wasn't what I expected. About 60% of requests were simple: classification, extraction, short-form generation. Another 25% were moderate: multi-turn reasoning, summarization, retrieval-augmented answers. The remaining 15% genuinely required a strong model. I was paying frontier prices for the whole stack.

The fix is what I now call a Multi-AI Router, or MAR. The core idea is simple: classify the incoming request before you pick a model, then route to the cheapest model that can handle the task correctly. This post covers how that works in practice, why the naive implementation fails, and what the evaluation harness looks like.

Any single-model serving architecture lives inside a cost-quality-latency triangle. You can have two of three. A small local model is cheap and fast but trades off quality on hard tasks. A large frontier model is high quality but expensive and slow. The typical workaround is to pick a model in the middle and accept mediocrity everywhere. The MAR approach reframes the problem: you don't need the same quality for every request. You need the right quality per request, delivered by the model best suited to that specific task shape.

Request classification is the first and most important step. Before any model sees the user's message, the MAR pipeline assigns it an intent label. The labels I use in nClaw map to roughly six categories: classify (determine a category, extract a label), extract (pull structured data from text), generate (write something new given a prompt), reason (multi-step problem solving, logical inference), search (decide what to retrieve, rank results), and code (write or analyze code). These aren't mutually exclusive. A coding question that requires multi-step planning gets tagged as both reason and code, which pushes it higher on the tier ladder.

The classification step itself should be cheap. In nClaw I run a small local classifier, currently a 4B instruct model, as the router. It emits a routing token, not a full response. The token might be classify, extract, generate:medium, reason:hard, or code:agentic. The classifier isn't making a quality judgment about the request. It's answering the narrower question: what kind of work is this? A 4B model can do that reliably at low cost and under 200 milliseconds.

The model registry is a Postgres table. Each row represents one model definition and declares its capabilities, cost profile, and latency characteristics. The schema I landed on in nClaw has fields for the model slug, role (router, general, reasoner, coder, verifier, embedder, reranker, summarizer), provider, API model name, context window size, parameter count, quantization format, whether it supports tool calls, whether it supports a thinking mode, and whether it's currently active. This is the single source of truth for what the router knows about its options.

Having the registry in Postgres rather than in application config means you can update it without a deploy. You can add a new local model, disable a provider that's having an outage, or adjust cost weights at runtime. The orchestrator reads the active model list on startup and caches it with a short TTL. When a routing rule references a model slug that no longer has an active row, the fallback chain fires automatically.

The tier ladder in nClaw has five levels. Tiny models handle pure classification: labeling intent, extracting structured fields, assigning a routing token. These run locally and are effectively free. The general tier handles the majority of user requests: multi-turn conversation, summarization, retrieval-augmented generation with moderate context. A 14B model handles most of these well. The reasoner tier handles tasks tagged as reason:hard, logic puzzles, and multi-document synthesis where the general model demonstrably underperforms. The code tier handles code-heavy turns, repository-scale reasoning, and agentic tool-use loops. Finally, the reranker isn't a generation model but a cross-encoder: it takes retrieval candidates and scores relevance against the query before the context window is assembled.

An optional verifier sits at the end of the pipeline for high-stakes outputs. It receives the generator's response and checks it against a rubric. In nClaw I use the verifier primarily for code outputs and factual claims that will be written to the memory store. The verifier doesn't regenerate the answer. It returns a structured verdict: correct, uncertain, or incorrect with a brief reason. If uncertain or incorrect, the orchestrator re-routes to the next tier up.

The practical model stack I run locally in nClaw reflects the Qwen3 family's current strengths. Qwen3-14B in FP8 handles the general tier. At roughly 15GB of weight memory and a 131K context window with YaRN, it covers most conversational and retrieval tasks comfortably. Qwen3-32B in AWQ handles the reasoner tier. Qwen3-Coder-30B handles code turns, and it earns its slot: 256K native context and explicit optimization for agentic coding workflows mean it actually handles multi-file diffs without the context-stuffing gymnastics you need with general models.

Fallback chains deserve more attention than most routing writeups give them. There are two primary patterns, each optimizing for a different failure mode. A frontier-first chain defaults to an external API and falls back to a local model when the API is unavailable, rate-limited, or errors out. Use this when you trust frontier quality and treat local inference as graceful degradation. A local-first chain defaults to a local model and escalates to a frontier API when the local model's confidence drops below a threshold. Use this when cost is the primary concern and you're willing to accept an occasional round-trip penalty for hard requests.

nClaw uses a hybrid approach. Simple and medium requests go local-first. The orchestrator samples a confidence estimate from the classifier and escalates to frontier if confidence is below 0.75. Hard and agentic requests go frontier-first, with a local fallback for availability. The threshold numbers aren't magic. They came from running a golden task evaluation over a few hundred labeled examples from my actual usage.

The routing rules themselves live in a second Postgres table. Each rule has a priority, an enabled flag, a JSON condition, a target model slug, and a fallback slug. The condition might be something like intent contains code AND context_tokens > 4000. The orchestrator evaluates rules in priority order and takes the first match. You can disable a rule without deleting it, which makes A/B testing easy to manage.

Qwen3's thinking mode deserves a separate note because it gives you something close to a two-tier proxy inside a single model. By default, Qwen3 operates in non-thinking mode: fast, direct, roughly equivalent to a well-tuned 14B or 32B model depending on which variant you run. Flip the thinking toggle and the same model runs internal chain-of-thought before producing its final answer, similar to a reasoning model. The quality jump on logic-heavy tasks is measurable. The latency cost is also measurable.

In practice this means you can route thinking-mode off for general requests and thinking-mode on for tasks tagged as reason:hard, without switching models or changing your inference endpoint. The model registry row for a Qwen3 model includes a supports_thinking boolean, and the routing rule can include a thinking: true flag in its condition payload. The gateway translates that into the appropriate API parameter before sending the request to the inference backend.

That single-model two-tier proxy pattern also matters for resource planning. Running two separate model processes, one general and one reasoner, requires two sets of GPU memory or two separate inference servers. Routing thinking mode inside one Qwen3-32B instance lets you defer a second server until your load actually justifies it. On my current setup, one Hetzner CAX11 handles the general tier and thinking-mode handles the occasional hard request. The heavier Qwen3-Coder model runs on a separate Apple Silicon node and is only invoked when the code tier fires.

The economic argument is cleaner when you put rough numbers on it. A frontier API call to a 70B-class model typically costs around $0.002 per 1K output tokens at current rates. A 14B model running locally on hardware you own amortizes to roughly $0.0003 per 1K tokens once you account for server cost. That 7x difference means if 60% of your requests are genuinely classifiable as simple-to-moderate, routing them local-first cuts your inference cost by roughly 50% compared to always hitting the frontier API. The numbers shift as model pricing changes, but the structure doesn't: specialization at the routing layer preserves budget for the requests that actually need it.

Latency tells a similar story. A 14B local model over a fast local network returns a first token in under 500 milliseconds in most configurations. A frontier API call incurs network round-trip plus queue time, which in practice means 1-3 seconds before the first token. For interactive use cases this difference is noticeable. Routing simple requests local-first feels faster to a user at a keyboard, even if aggregate throughput is similar.

Now the question that makes or breaks the whole approach: how do you know the router is making correct decisions? The naive answer is vibes. You run it for a week and see if quality degrades. That's not an evaluation harness.

The evaluation harness in nClaw has two main components. The first is a golden task set: labeled examples drawn from my actual usage. Each example has an input, a routing label (which model should have been used), and a quality reference (the expected output or a rubric for judging it). The golden set started with about 200 examples and grows incrementally as I add annotations. When I change a routing rule or update the classifier, I run the golden set and measure routing accuracy: what percentage of examples got routed to the expected tier.

The second component is counterfactual sampling. For a random sample of live requests, the orchestrator routes the request as normal but also sends it to the next tier up. Both responses are logged. A background job compares them: are the outputs semantically equivalent? If yes, the cheaper routing was correct. If no, that request should have escalated. Over time this builds a picture of where the routing thresholds are miscalibrated, without manual annotation of every request.

A/B testing at the routing layer means having two routing rule sets active simultaneously, split by user session or request hash. Half of requests go through the current ruleset. Half go through a candidate ruleset where you've adjusted a threshold or added a rule. After enough samples you compare quality scores and cost-per-request between the two groups. It's the same evaluation discipline you'd apply to any production system change, but routing decisions are cheap enough to make it practical even at low traffic volumes.

There are three anti-patterns I've either built and regretted or seen described confidently in architectural writeups that don't survive contact with real usage. Over-aggressive routing is the first. If you classify 80% of requests as simple and route them to a 4B model, your quality floor drops quickly. Routing aggressively enough to actually save cost means accepting visible quality degradation on a non-trivial fraction of requests. The right threshold is project-specific and should be calibrated from data, not assumed.

Classifier hallucination is the second anti-pattern. A small routing model can confabulate intent labels on ambiguous inputs. A message that looks like a simple extract request might actually require multi-step reasoning because of context in the thread history the classifier didn't see. If the classifier only looks at the current message without thread summary context, its error rate on context-dependent inputs is meaningfully higher. In nClaw the classifier receives a brief thread summary alongside the new message, which cuts ambiguous misroutes substantially.

Capability declaration drift is the third, and it's the one that bites you slowly. Your model registry declares that model X supports tool calls. You deploy a new version of the inference backend. The new version has a different tool-call format. Your routing rules still say model X is available for code:agentic, tool calls fire, and the model returns malformed responses. The registry is out of date and everything downstream fails quietly. The fix is to include capability smoke tests in your deployment pipeline: before marking a model as active in the registry, run a minimal capability check against the actual inference endpoint.

One lesson I'd give my earlier self: build the evaluation harness before you tune the routing rules, not after. The temptation is to get routing working first and add measurement later. In practice, routing without measurement means you have no idea whether your changes are improving or degrading quality. The golden task set is small to start with (50 examples is enough to catch gross misroutes), and it compounds in value as you add more examples from real failures.

The MAR architecture in nClaw isn't finished. The classifier needs better handling of multi-intent requests. The fallback chain doesn't yet track degradation patterns, so if a particular class of requests consistently fails at the general tier and escalates to the reasoner, there's no automatic rule promotion. The evaluation harness needs a UI. These are incremental improvements, not blockers. That's the point of getting the kernel stable first.

What's working: the cost curve is significantly flatter than single-model serving. The quality distribution is more consistent because hard requests reliably get the stronger model instead of occasionally getting dropped into a shared pool. The model registry in Postgres means I can experiment with new models without code changes. And the Qwen3 thinking mode toggle gives me a lightweight way to handle the moderate-to-hard boundary without running a second server.

If you're building something similar, the most important decision is where you put the routing logic. Keep it in the orchestrator layer, not in the client. If each client decides which model to call, you can't change routing rules without deploying every client. The orchestrator is the right place: one routing surface, inspectable in Postgres, testable against a golden set. Everything downstream just receives a completion.

The Multi-AI Router: Stop Sending Every Question to GPT-5.4 or Opus 4.7

Related Posts

Infinite Memory: How nClaw Recalls a Conversation From a Year Ago

Owning Your Stack: Local LLMs Plus Frontier Models, On Your Terms

Hybrid Retrieval: Why pgvector Alone Isn't Enough