Skip to content
Build a Realtime Voice Assistant With gpt-realtime-2 — ContentBuffer guide

Build a Realtime Voice Assistant With gpt-realtime-2

K
Kodetra Technologies··10 min read Intermediate

Summary

Stream audio in and out, add tools, approvals, and handoffs with gpt-realtime-2 in Python.

Build a Realtime Voice Assistant With gpt-realtime-2

On June 3, 2026 OpenAI shipped a fresh batch of real-time audio models aimed squarely at conversational agents, and the developer chatter since has been about one thing: you no longer need to glue together a speech-to-text model, an LLM, and a text-to-speech model with three round trips of latency. A single model now listens, thinks, calls your tools, and talks back over one persistent connection.

The fastest way to build on it from Python is the OpenAI Agents SDK, which has first-class support for the Realtime API. You write a normal agent with instructions and tools, point it at gpt-realtime-2, and the SDK manages the WebSocket, audio streaming, turn-taking, interruptions, and tool execution for you. That is the difference between a weekend hack and a phone-quality voice assistant.

This guide walks through a working voice agent end to end: a server-side session, streaming audio, a real function tool, a handoff to a specialist agent, and an output guardrail. Every code snippet is checked against the current SDK docs. By the end you will have a support-line voice agent that looks up orders, transfers to billing, and refuses to read card numbers out loud.

What you will build

A server-side realtime voice agent that you can drive from a script or wire into a microphone. It runs on gpt-realtime-2, detects when the caller stops talking, streams spoken replies back as audio, calls a Python function to fetch an order status, hands the conversation to a billing specialist when needed, and trips a guardrail before it ever speaks a full credit-card number.

  • A RealtimeAgent with instructions and a function tool.
  • A RealtimeRunner configured with nested audio settings and semantic turn detection.
  • An event loop that handles audio, history_added, agent_end, and error.
  • A real microphone-to-speaker path using send_audio().
  • Tool approvals, a handoff to a billing agent, and an output guardrail.

Prerequisites

  • Python 3.10 or higher.
  • An OpenAI API key with access to the Realtime API.
  • Basic familiarity with async/await in Python.
  • For the microphone demo: a working mic and speakers, plus the sounddevice and numpy packages.

One boundary worth knowing up front: the Python SDK runs realtime sessions server-side over WebSocket. It does not ship a browser WebRTC transport. Use it for backend orchestration, tools, approvals, and telephony; use the JavaScript SDK if you need the audio to live in the browser.

Step 1 — Install the SDK and set your key

pip install openai-agents

# For the microphone demo later in this guide
pip install sounddevice numpy

Export your key so the SDK can authenticate the WebSocket automatically:

export OPENAI_API_KEY="sk-..."

If you would rather pass the key explicitly, you can hand it to the runner at start time with model_config={"api_key": "..."}. Just remember: if you supply your own headers, the SDK will not inject an Authorization header for you.

Step 2 — Define a RealtimeAgent

A RealtimeAgent is intentionally narrower than the regular Agent type. The model is chosen at the session level, not on the agent, and structured outputs are not supported. What does carry over: instructions, function tools, handoffs, hooks, and output guardrails. Keep the instructions short and conversational, because the model is speaking, not writing an essay.

from agents.realtime import RealtimeAgent

agent = RealtimeAgent(
    name="Support",
    instructions=(
        "You are a friendly phone support agent for an online store. "
        "Keep replies short and conversational, one or two sentences. "
        "Ask for an order number before looking anything up."
    ),
)

Step 3 — Configure the RealtimeRunner

The runner is a session factory. It wires your starting agent to the Realtime transport and holds the audio settings. For new agents, start with gpt-realtime-2 and the nested audio.input / audio.output shape. The older flat aliases (input_audio_format, turn_detection, and friends) still work, but the nested form is what the docs recommend going forward.

from agents.realtime import RealtimeRunner

runner = RealtimeRunner(
    starting_agent=agent,
    config={
        "model_settings": {
            "model_name": "gpt-realtime-2",
            "audio": {
                "input": {
                    "format": "pcm16",
                    "transcription": {"model": "gpt-4o-mini-transcribe"},
                    "turn_detection": {
                        "type": "semantic_vad",
                        "interrupt_response": True,
                    },
                },
                "output": {
                    "format": "pcm16",
                    "voice": "ash",
                },
            },
            "tool_choice": "auto",
        }
    },
)

Two settings do most of the heavy lifting. semantic_vad turn detection lets the model decide when you have actually finished a thought instead of cutting you off at the first pause, and interrupt_response: True means if the caller starts talking while the agent is speaking, the agent stops and listens. That barge-in behavior is what makes a voice agent feel human.

Step 4 — Open the session and handle events

Unlike a text run, runner.run() does not return a finished answer. It returns a live RealtimeSession. You enter it with async with, push input, and then iterate over events as they stream in. Here is a minimal text-driven loop you can run right now to confirm everything is wired up:

import asyncio
from agents.realtime import RealtimeAgent, RealtimeRunner

agent = RealtimeAgent(
    name="Support",
    instructions="You are a helpful voice assistant. Keep replies short.",
)

runner = RealtimeRunner(
    starting_agent=agent,
    config={
        "model_settings": {
            "model_name": "gpt-realtime-2",
            "audio": {
                "input": {"format": "pcm16"},
                "output": {"format": "pcm16", "voice": "ash"},
            },
        }
    },
)

async def main() -> None:
    session = await runner.run()
    async with session:
        await session.send_message("Say hello in one short sentence.")
        async for event in session:
            if event.type == "audio":
                # Raw PCM16 bytes you could pipe to a speaker.
                pass
            elif event.type == "history_added":
                print(event.item)
            elif event.type == "agent_end":
                break
            elif event.type == "error":
                print(f"Error: {event.error}")

if __name__ == "__main__":
    asyncio.run(main())

Run it with python agent.py. You will see the assistant turn print as a history item, while the spoken version arrives in the audio events as raw PCM16 bytes. Example console output:

$ python agent.py
RealtimeMessageItem(role='assistant',
  content=[{'type': 'output_audio',
            'transcript': 'Hi there! How can I help you today?'}])

send_message() takes a plain string or a structured message. The structured form is also how you send an image into a voice conversation, using a content list with an input_image entry alongside input_text.

Step 5 — Give the agent a tool

A voice agent that cannot do anything is just a chatbot with a microphone. Function tools work during live conversations exactly like they do in text agents: decorate a Python function with @function_tool and add it to the agent. The model will call it mid-sentence, get the result, and keep talking.

from agents import function_tool
from agents.realtime import RealtimeAgent

FAKE_ORDERS = {
    "A1001": "shipped, arriving Tuesday",
    "A1002": "processing in our warehouse",
}

@function_tool
def get_order_status(order_id: str) -> str:
    """Look up the status of an order by its ID."""
    return FAKE_ORDERS.get(order_id.upper(), "no order found with that ID")

agent = RealtimeAgent(
    name="Support",
    instructions=(
        "You are a phone support agent. Ask for an order number, "
        "then call get_order_status and read the result back plainly."
    ),
    tools=[get_order_status],
)

Now a caller who says “Where is order A1001?” triggers the tool, and the agent speaks back “Your order shipped and should arrive Tuesday.” You will also see tool_start and tool_end events flow through your loop, which are handy for showing a “looking that up…” indicator in a UI.

Step 6 — Stream real microphone audio

Text is fine for testing, but the point is voice. Use session.send_audio() to push raw PCM16 bytes from a microphone, and play the audio events back through your speakers. With semantic_vad turn detection enabled, the server decides when you are done speaking, so you simply keep streaming chunks.

import asyncio
import numpy as np
import sounddevice as sd
from agents.realtime import RealtimeAgent, RealtimeRunner

SAMPLE_RATE = 24000  # gpt-realtime expects 24 kHz pcm16

agent = RealtimeAgent(name="Support", instructions="Keep replies short.")
runner = RealtimeRunner(
    starting_agent=agent,
    config={"model_settings": {
        "model_name": "gpt-realtime-2",
        "audio": {
            "input": {"format": "pcm16",
                       "turn_detection": {"type": "semantic_vad",
                                          "interrupt_response": True}},
            "output": {"format": "pcm16", "voice": "ash"},
        },
    }},
)

async def mic_loop(session):
    loop = asyncio.get_running_loop()
    q: asyncio.Queue = asyncio.Queue()
    def on_audio(indata, frames, time, status):
        loop.call_soon_threadsafe(q.put_nowait, bytes(indata))
    with sd.RawInputStream(samplerate=SAMPLE_RATE, channels=1,
                           dtype="int16", callback=on_audio):
        while True:
            await session.send_audio(await q.get())

async def main():
    session = await runner.run()
    async with session:
        asyncio.create_task(mic_loop(session))
        out = sd.RawOutputStream(samplerate=SAMPLE_RATE, channels=1,
                                 dtype="int16")
        out.start()
        async for event in session:
            if event.type == "audio":
                out.write(event.audio.data)
            elif event.type == "audio_interrupted":
                out.stop(); out.start()  # flush on barge-in
            elif event.type == "error":
                print("Error:", event.error)

asyncio.run(main())

Speak into the mic and the agent talks back. When you interrupt it mid-reply, the audio_interrupted event fires; flushing the output stream there stops stale audio so the conversation stays in sync with what the caller actually heard. If turn detection were disabled, you would instead mark the end of a turn yourself with await session.send_audio(chunk, commit=True).

Step 7 — Add approvals and a handoff

For anything sensitive, such as issuing a refund, you want a human in the loop. A function tool can require approval; when the model tries to call it, the session emits tool_approval_required and pauses until you approve or reject it.

async for event in session:
    if event.type == "tool_approval_required":
        # In production, gate this on a real human decision.
        await session.approve_tool_call(event.call_id)
        # or: await session.reject_tool_call(event.call_id)

Handoffs let one agent pass the live call to a specialist. Build the specialist as its own RealtimeAgent and attach it with realtime_handoff. The triage agent decides when to transfer; the voice and session carry over seamlessly.

from agents.realtime import RealtimeAgent, realtime_handoff

billing_agent = RealtimeAgent(
    name="Billing",
    instructions="You handle refunds and billing questions only.",
)

triage_agent = RealtimeAgent(
    name="Support",
    instructions=(
        "Help with orders. If the caller asks about a refund or charge, "
        "hand off to billing."
    ),
    tools=[get_order_status],
    handoffs=[realtime_handoff(billing_agent,
                              tool_description="Transfer to billing support")],
)

One catch: realtime handoffs do not support the regular handoff input_filter. The whole conversation goes across as-is.

Step 8 — Guard the output

Realtime agents support output guardrails only, and they behave differently from text guardrails. They run on debounced transcript chunks rather than every token, and instead of raising an exception they emit a guardrail_tripped event, cancel the in-flight response, and nudge the model to produce a safer reply.

import re
from agents.guardrail import GuardrailFunctionOutput, OutputGuardrail
from agents.realtime import RealtimeAgent

CARD = re.compile(r"\b(?:\d[ -]?){13,16}\b")

def no_card_numbers(context, agent, output):
    return GuardrailFunctionOutput(
        tripwire_triggered=bool(CARD.search(output)),
        output_info=None,
    )

agent = RealtimeAgent(
    name="Support",
    instructions="Never read full card numbers aloud.",
    output_guardrails=[OutputGuardrail(guardrail_function=no_card_numbers)],
)

Because guardrails run on debounced text, some audio may already be buffered when the tripwire fires. Always listen for audio_interrupted in your playback loop and stop local playback immediately, or the caller will hear the first half of the forbidden sentence.

Putting it together: a support line

Combine the pieces and you have a complete worked example. A caller dials in, the triage agent greets them and asks for an order number, calls get_order_status to read it back, and transfers to the billing agent the moment a refund comes up. A guardrail sits over both agents so neither one ever speaks a full card number, and the refund tool waits for human approval before it runs.

A real session transcript looks like this:

Caller:  Hi, where's my order A1001?
Support: Let me check. Your order shipped and should arrive Tuesday.
Caller:  Great. Actually I want a refund on it.
Support: Sure, let me bring in our billing team.
[handoff -> Billing]
Billing: I can help with that refund. Can you confirm the order number?
[tool_approval_required: issue_refund(order_id='A1001')]
... agent waits for your approve_tool_call() ...

Common pitfalls and gotchas

  • Forgetting the session context. runner.run() only returns a session object. Nothing connects until you enter async with session: (or call await session.enter()). Skip it and your event loop never fires.
  • Wrong sample rate. gpt-realtime works in 24 kHz PCM16. Capture at 44.1 kHz or 48 kHz without resampling and the agent hears chipmunk audio and mis-transcribes everything.
  • Not handling audio_interrupted. If you do not flush your speaker buffer on barge-in, the caller keeps hearing the old reply while the model has already moved on. This also matters when a guardrail trips mid-sentence.
  • Expecting input guardrails. Realtime agents support output guardrails only. There is no input guardrail hook; validate or gate user input yourself before triggering a response.
  • Setting the model on the agent. RealtimeAgent ignores per-agent model choice. The model lives in the runner's model_settings. Set it there or you will silently get the default.
  • Passing custom headers and losing auth. If you supply headers in model_config, the SDK stops adding Authorization for you. Include the auth header yourself, especially on Azure endpoints.
  • Changing voice mid-call. The output voice cannot change after the session has produced spoken audio. Pick it before the first reply.
  • Reaching for browser WebRTC in Python. The Python SDK is server-side WebSocket (and SIP for telephony) only. For in-browser audio, use the JavaScript SDK and keep Python on the backend.

Quick reference

PieceWhat it doesKey call / setting
RealtimeAgentInstructions, tools, handoffs, output guardrailsRealtimeAgent(name, instructions, tools=...)
RealtimeRunnerWires agent to transport, holds audio configRealtimeRunner(starting_agent, config=...)
Start sessionReturns a live session (not a result)session = await runner.run()
Open connectionConnects the WebSocketasync with session:
Send textPush a user message, start a responseawait session.send_message(text)
Send audioStream raw PCM16 mic bytesawait session.send_audio(chunk)
ModelRecommended realtime modelmodel_name='gpt-realtime-2'
Turn-takingServer decides end of turn + barge-inturn_detection: semantic_vad
Key eventsDrive your UI and playbackaudio, history_added, agent_end, error
HandoffTransfer call to a specialistrealtime_handoff(other_agent)
ApprovalPause a tool for a humanapprove_tool_call / reject_tool_call

Next steps

  • Wire the agent to a phone line with the SDK's SIP transport (OpenAIRealtimeSIPModel) and the Realtime Calls API.
  • Add a RealtimePlaybackTracker for accurate interruption truncation in telephony or any delayed-playback setup.
  • Give the agent MCP tools so it can hit your real systems instead of a fake order dictionary.
  • Add tracing to inspect tool calls, handoffs, and guardrail events for every conversation.
  • Browse the official examples/realtime directory for full mic, web, and Twilio demos.

Sources: OpenAI Agents SDK realtime quickstart and guide (openai.github.io/openai-agents-python), and OpenAI's June 3, 2026 real-time audio model announcement.

Comments

Subscribe to join the conversation...

Be the first to comment

Found this useful?

Get new AI guides for builders by email. Free.

Join 1,970 builders reading daily.