Build an Apple-Style Multi-Model AI Router in Python

On June 8, 2026, Apple did something it had resisted for a decade. At WWDC, it shipped an Apple Intelligence Extensions system that lets you choose which model answers you on your iPhone: Google Gemini (the default), OpenAI's ChatGPT, or Anthropic's Claude. Each one even keeps its own voice, so you know who is talking. A single product now routes your requests to whichever frontier model you trust most.

That idea is the interesting part for builders. "Pick your model" is not an Apple feature, it is an architecture. If your own app is hard-wired to one provider, you inherit that provider's outages, price changes, and weak spots. The fix is a thin layer that speaks one language to your code and translates to many models underneath. Apple just validated the pattern in front of two billion devices.

This guide builds that layer in Python from scratch, with no heavyweight framework. You will wrap Claude, Gemini, and GPT behind one interface, route each request to the best model for the job, and fall back automatically when a provider returns a 529. The routing core is fully runnable today, and the example output below is real.

What you'll build

A unified reply type so Claude, Gemini, and GPT all return the same shape to your code.
Three provider adapters using each vendor's current 2026 SDK call, verified against the official docs.
A ModelRouter that picks a model per task (code, fast, reasoning, chat) and fails over to the next provider on error.
A worked example: one HTTP-style handler that serves three different task types through the same router.

Prerequisites

Python 3.10+ (the code uses dataclasses and list[str] syntax).
API keys for the providers you want to enable. You can run the router with zero keys using the mock provider shown at the end.
The three official SDKs: pip install anthropic google-genai openai.
Basic familiarity with environment variables for keys (ANTHROPIC_API_KEY, GEMINI_API_KEY, OPENAI_API_KEY).

Step 1 - The one idea that makes routing work

Every provider has a different request and response shape. Anthropic returns message.content[0].text. The Google Gen AI SDK gives you response.text. OpenAI's Responses API exposes response.output_text. If that leaks into your app code, you have not built a router, you have built three apps. So the first move is to define a normalized type that every adapter must produce.

from dataclasses import dataclass, field

@dataclass
class Reply:
    text: str
    model: str
    provider: str
    latency_ms: int
    usage: dict = field(default_factory=dict)

class ProviderError(Exception):
    """Raised by an adapter when a call fails and should trigger fallback."""

Now define the contract. Every backend is a Provider with one method, complete(), that takes a prompt and returns a Reply. This is the whole abstraction. Your application will only ever see Reply objects, never a raw SDK response.

class Provider:
    name = "base"
    def __init__(self, model: str):
        self.model = model
    def complete(self, prompt: str, system: str = "",
                 max_tokens: int = 1024) -> Reply:
        raise NotImplementedError

Step 2 - Write the three provider adapters

Each adapter does three things: call the vendor SDK, catch transport or overload errors and re-raise them as ProviderError, and pack the result into a Reply. Here is Claude, using the Anthropic Messages API. The text lives in message.content[0].text.

import time
from anthropic import Anthropic

class ClaudeProvider(Provider):
    name = "claude"
    def __init__(self, model="claude-opus-4-6"):
        super().__init__(model)
        self.client = Anthropic()  # reads ANTHROPIC_API_KEY

    def complete(self, prompt, system="", max_tokens=1024):
        t0 = time.time()
        try:
            msg = self.client.messages.create(
                model=self.model,
                max_tokens=max_tokens,
                system=system or None,
                messages=[{"role": "user", "content": prompt}],
            )
        except Exception as e:
            raise ProviderError(f"claude: {e}") from e
        return Reply(
            text=msg.content[0].text,
            model=self.model, provider=self.name,
            latency_ms=int((time.time() - t0) * 1000),
            usage={"input_tokens": msg.usage.input_tokens,
                   "output_tokens": msg.usage.output_tokens},
        )

Next, Gemini, using Google's google-genai SDK. Note the different ergonomics: the system instruction goes in a config object, and the answer is simply response.text.

from google import genai
from google.genai import types

class GeminiProvider(Provider):
    name = "gemini"
    def __init__(self, model="gemini-3.5-flash"):
        super().__init__(model)
        self.client = genai.Client()  # reads GEMINI_API_KEY

    def complete(self, prompt, system="", max_tokens=1024):
        t0 = time.time()
        try:
            resp = self.client.models.generate_content(
                model=self.model,
                contents=prompt,
                config=types.GenerateContentConfig(
                    system_instruction=system or None,
                    max_output_tokens=max_tokens,
                ),
            )
        except Exception as e:
            raise ProviderError(f"gemini: {e}") from e
        um = resp.usage_metadata
        return Reply(
            text=resp.text,
            model=self.model, provider=self.name,
            latency_ms=int((time.time() - t0) * 1000),
            usage={"input_tokens": um.prompt_token_count,
                   "output_tokens": um.candidates_token_count},
        )

Finally GPT, using OpenAI's Responses API. The system prompt is passed as instructions, and the convenience accessor response.output_text flattens the output array into a string for you.

from openai import OpenAI

class GPTProvider(Provider):
    name = "gpt"
    def __init__(self, model="gpt-5.5"):
        super().__init__(model)
        self.client = OpenAI()  # reads OPENAI_API_KEY

    def complete(self, prompt, system="", max_tokens=1024):
        t0 = time.time()
        try:
            resp = self.client.responses.create(
                model=self.model,
                instructions=system or None,
                input=[{"role": "user", "content": prompt}],
                max_output_tokens=max_tokens,
            )
        except Exception as e:
            raise ProviderError(f"gpt: {e}") from e
        u = resp.usage
        return Reply(
            text=resp.output_text,
            model=self.model, provider=self.name,
            latency_ms=int((time.time() - t0) * 1000),
            usage={"input_tokens": u.input_tokens,
                   "output_tokens": u.output_tokens},
        )

Three vendors, three SDK styles, one return type. Your app no longer cares which one ran.

Step 3 - The router: task routing plus fallback

The router holds a dictionary of providers and a policy: a function that, given a task label, returns an ordered list of provider keys to try. It walks that list, returns the first success, and on a ProviderError moves to the next candidate. If every option fails, it raises.

from typing import Callable

class ModelRouter:
    def __init__(self, providers: dict[str, Provider],
                 policy: Callable[[str], list[str]]):
        self.providers = providers
        self.policy = policy

    def ask(self, prompt: str, task: str = "chat",
            system: str = "") -> Reply:
        order = self.policy(task)
        last = None
        for key in order:
            prov = self.providers.get(key)
            if not prov:
                continue
            try:
                return prov.complete(prompt, system=system)
            except ProviderError as e:
                last = e
                print(f"  ! {key} failed ({e}); falling back")
        raise ProviderError(f"all providers failed for {task}: {last}")

The policy is where your product judgment lives. Route coding prompts to your strongest coder, high-volume cheap traffic to the fastest model, and heavy reasoning to whichever model leads on your evals. Order matters: position one is the primary, the rest are fallbacks.

def policy(task: str) -> list[str]:
    table = {
        "code":      ["claude", "gpt", "gemini"],
        "fast":      ["gemini", "gpt", "claude"],
        "reasoning": ["gpt", "claude", "gemini"],
        "chat":      ["gemini", "claude", "gpt"],
    }
    return table.get(task, table["chat"])

Step 4 - Run it, with real fallback output

To prove the wiring works without spending a token, swap in a MockProvider. We make the claude mock fail on purpose so you can watch the router fail over to gpt for a coding task. This exact script runs as-is.

import time

class MockProvider(Provider):
    def __init__(self, name, model, reply, fail=False, delay=0.02):
        super().__init__(model)
        self.name, self._reply = name, reply
        self._fail, self._delay = fail, delay
    def complete(self, prompt, system="", max_tokens=1024):
        t0 = time.time(); time.sleep(self._delay)
        if self._fail:
            raise ProviderError(f"{self.name} 529 overloaded")
        return Reply(self._reply, self.model, self.name,
                     int((time.time()-t0)*1000),
                     {"input_tokens": len(prompt.split()),
                      "output_tokens": len(self._reply.split())})

providers = {
    "claude": MockProvider("claude", "claude-opus-4-6",
                           "def is_prime(n): ...", fail=True),
    "gpt":    MockProvider("gpt", "gpt-5.5",
              "def is_prime(n):\n    return n>1 and all("
              "n%i for i in range(2,int(n**.5)+1))"),
    "gemini": MockProvider("gemini", "gemini-3.5-flash",
                           "Here's a quick summary."),
}
router = ModelRouter(providers, policy)

r = router.ask("Write an is_prime function", task="code")
print(f"-> {r.provider}/{r.model} in {r.latency_ms}ms")
print(r.text)
print("usage:", r.usage)

Output (verbatim from the run):

  ! claude failed (claude 529 overloaded); falling back
-> gpt/gpt-5.5 in 29ms
def is_prime(n):
    return n>1 and all(n%i for i in range(2,int(n**.5)+1))
usage: {'input_tokens': 4, 'output_tokens': 10}

The coding policy tried Claude first, the mocked outage raised a ProviderError, and the router silently moved to GPT and returned a clean Reply. Delete the mocks, register the three real adapters under the same keys, and the behavior is identical against live APIs.

Worked example: one endpoint, three task types

Here is the payoff. A single request handler classifies the incoming task and hands it to the router. The application code is four lines and mentions no vendor at all. Switching your primary coder from Claude to GPT later is a one-line edit in policy(), with no change to this handler.

def handle(request: dict) -> dict:
    reply = router.ask(
        prompt=request["prompt"],
        task=request.get("task", "chat"),
        system=request.get("system", ""),
    )
    return {
        "answer": reply.text,
        "served_by": f"{reply.provider}:{reply.model}",
        "latency_ms": reply.latency_ms,
        "usage": reply.usage,
    }

# three different jobs, one code path
print(handle({"prompt": "Refactor this loop", "task": "code"})["served_by"])
print(handle({"prompt": "TL;DR this thread", "task": "fast"})["served_by"])
print(handle({"prompt": "Plan a migration",  "task": "reasoning"})["served_by"])

With the mocks above (Claude forced down), this prints gpt:gpt-5.5, then gemini:gemini-3.5-flash, then gpt:gpt-5.5 for the reasoning task whose policy starts at GPT. Your users get the right model per job and never see an outage, because the router absorbs it.

Common pitfalls and how to avoid them

1. Letting raw SDK objects escape the adapter. The moment a Gemini response or an Anthropic message reaches your business logic, the abstraction is broken and a future provider swap becomes a refactor. Rule: adapters return Reply, nothing else. Treat any import anthropic outside an adapter file as a code smell.

2. Falling back on errors that will never succeed. A 529 (overloaded) or a timeout is worth retrying on another provider. A 400 from a malformed prompt, or a 401 from a missing key, is not, it will fail everywhere and just burns latency and money. Inspect the error and only raise ProviderError for retryable cases; let permanent errors propagate immediately.

3. Pretending the models are interchangeable. Token limits, system-prompt handling, and tool-call formats differ. Gemini wants the system instruction in a config; OpenAI calls it instructions; Anthropic takes a top-level system. The adapter is exactly where you normalize these differences, so keep that logic inside it rather than leaking conditionals into the router.

4. Hard-coding model names in twelve places. Keep model IDs in the adapter constructors (or a config file) so a version bump from claude-opus-4-6 to the next release is a single change. The same applies to gemini-3.5-flash and gpt-5.5.

5. No cost or latency visibility. Because every Reply carries provider, model, latency_ms, and usage, log them on every call. Without that you cannot answer "which model is actually serving traffic and what is it costing me," which is the entire reason you built a router.

6. Forgetting fallback changes your output distribution. When Claude is down and GPT answers instead, the response style and structured-output shape can shift. If downstream code parses JSON, validate it after the router returns, not inside one provider's adapter, so every path is checked the same way.

Quick reference

Provider	SDK (2026)	Client + call	Text accessor	Example model ID
Anthropic Claude	anthropic	Anthropic().messages.create(...)	message.content[0].text	claude-opus-4-6
Google Gemini	google-genai	genai.Client().models.generate_content(...)	response.text	gemini-3.5-flash
OpenAI GPT	openai	OpenAI().responses.create(...)	response.output_text	gpt-5.5

Router design cheatsheet:

Piece	Job	Keep it where
Reply dataclass	One return shape for all models	Shared module
Provider.complete()	Call SDK, return Reply	One file per vendor
ProviderError	Signal a retryable failure	Raised only in adapters
policy(task)	Ordered provider preference	One function, easy to edit
ModelRouter.ask()	Try in order, fall back	Never edited per vendor

Next steps

Add streaming: give Provider a stream() method and yield normalized text chunks so the router works behind a chat UI.
Make fallback smarter: add a per-provider timeout and a circuit breaker so a flapping backend is skipped for 30 seconds instead of retried every call.
Route by cost: extend the policy to weigh usage from past calls and prefer the cheapest model that still passes your eval bar for that task.
Add tool use: normalize each vendor's function-calling format inside the adapter, exposing one tool schema to your app.
Wire real keys and load-test: register the live adapters, fire the three task types under load, and watch the logged provider/latency/usage to tune the policy.

Apple's "pick your AI" moment made one thing obvious: no single model wins every task, and depending on one provider is a liability. A 120-line router gives your own product the same resilience, the same freedom to swap models as the leaderboard churns, and the same clean separation Apple just shipped to a billion phones.