Building Voice Agents with Azure OpenAI Realtime API
Everything I learned building two production voice agent systems — a Memory Agent and a Visit Agent — and the Azure OpenAI Realtime WebRTC SDK that came out of it. Latency constraints, ephemeral token architecture, turn detection, transcript sync, post-call pipelines, and the mistakes I wish I hadn't made.
Voice agents are not chatbots with a mic taped on. They require a completely different mental model — one where latency is the product, where state machines replace request-response cycles, and where the wrong architectural decision at the ephemeral token layer will haunt you in production at 2 AM.
This is everything I learned building two voice agent systems — a Memory Agent and a Visit Agent — and shipping azure-realtime-webrtc, the TypeScript SDK that emerged from that work.
Why voice agents are architecturally different
The standard LLM integration looks like this: user sends text → you call an API → you wait → you return a response. Latency of a few seconds is annoying but tolerable. Users expect it.
Voice is the opposite. Humans start evaluating a conversation within 200–400ms of speaking. If your agent does not respond within that window, users assume the call dropped. Every architectural decision you make flows from this constraint.
The second difference is state. A text chat can be stateless — each message carries full context. Voice is a continuous audio stream. Your agent must track:
- Whether the user is currently speaking
- Whether it should be listening or processing
- Whether audio output is playing (and must be interrupted)
- Whether a tool call is in flight
Get any of these wrong and you get the most embarrassing failure mode in voice AI: the agent speaking over itself, or the user speaking while the agent is talking, with no one listening to either.
The Azure OpenAI Realtime architecture
Azure OpenAI's Realtime API runs on WebRTC (browser) or WebSocket (server), with a data channel carrying JSON events alongside the audio stream. The key insight that took me a while to internalize: the model is not stateless. It maintains conversation state server-side across the entire session. You're not sending messages — you're sending events that mutate shared state.
The session flow:
Browser Your Server Azure OpenAI
│ │ │
│ POST /api/realtime/token │ │
│────────────────────────────>│ POST /client_secrets │
│ │────────────────────────>│
│ │ { value: token } │
│ { token } │<────────────────────────│
│<────────────────────────────│ │
│ │
│ WebRTC SDP offer + ephemeral token │
│──────────────────────────────────────────────────────>│
│ SDP answer │
│<──────────────────────────────────────────────────────│
│ │
│ ◄══ Bidirectional audio (WebRTC media track) ═══► │
│ ◄══ JSON events (WebRTC data channel) ════════► │
Your API key never reaches the browser. The ephemeral token is short-lived (minutes), scoped to a single session, and carries the session configuration — instructions, voice, tools, turn detection. This is a security model you have to build deliberately.
The ephemeral token is not just auth
This was the biggest conceptual shift. Most developers treat the ephemeral token as equivalent to an API key — a credential you swap for access. It is not. The token is the session configuration. Whatever you encode in the POST to /client_secrets becomes the session:
const response = await fetch(
`https://${resource}.openai.azure.com/openai/v1/realtime/client_secrets`,
{
method: 'POST',
headers: { 'api-key': apiKey, 'Content-Type': 'application/json' },
body: JSON.stringify({
session: {
type: 'realtime',
model: deployment,
instructions: systemPrompt,
tools: toolDefinitions,
tool_choice: 'auto',
audio: {
input: {
transcription: { model: 'whisper-1', language: 'en' },
},
output: { voice: 'cedar' },
},
},
}),
}
);
const { value: token, expires_at } = await response.json();
The implications of this:
-
System prompt injection happens server-side, before the token is minted. You can inject the authenticated user's name, role, or context into the instructions at token generation time. The browser never sees the raw prompt.
-
Tool definitions travel with the token. You choose which tools are available for this session at this moment — not globally. In the Memory Agent, the "capture" persona gets
save_insight+get_session_context. The "assistant" persona getsretrieve_knowledge+get_session_context. Same model deployment, different tool sets, determined server-side. -
tool_choice: "required"is a trap. I tried it. The model would loop calling tools after receiving results and never actually speak. Stay on"auto"and enforce retrieval through the tool description and system prompt instead.
Turn detection and the speaking state machine
Voice turn detection is where most agents break in the real world. Azure offers two modes:
server_vad: volume-threshold based voice activity detection. Fast but will fire on background noise, TV audio, keyboard clicks.semantic_vad: model-aware turn detection that understands sentence boundaries. Slower but dramatically fewer false positives.
In the Memory Agent, which ran in noisy open-office environments, semantic VAD was the correct choice despite the latency cost. In the Visit Agent, which ran structured interviews in quieter settings, server VAD worked fine.
The state machine that matters is not in Azure — it is in your client code:
type VoiceAgentState =
| 'idle'
| 'connecting'
| 'connected'
| 'listening' // user can speak, mic is active
| 'thinking' // model received audio, processing
| 'speaking' // model is playing audio output
| 'error';
Transitions you must handle explicitly:
listening→thinking: triggered byinput_audio_buffer.speech_stoppedthinking→speaking: triggered byoutput_audio_buffer.startedspeaking→listening: triggered byoutput_audio_buffer.stopped- Any state →
listening: on user interrupt (input_audio_buffer.speech_startedwhile model is speaking)
The interrupt case is the one that kills demos. When a user starts speaking while the model is talking, you need to cancel the current audio output immediately and start processing the new input. Azure handles this on its end — but your UI must reflect it instantly or users think the agent is broken.
Function calling over the data channel
Tool calls in the Realtime API are asynchronous events on the data channel, not synchronous returns. The sequence:
- Model emits
response.function_call_arguments.donewithcall_id,name,arguments - You execute the tool handler (can be async, can hit a database)
- You send back a
conversation.item.createevent with the tool result - You send
response.createto trigger the model's next response
In the Memory Agent, tool execution involved pgvector similarity search against hundreds of knowledge chunks. The round-trip for retrieve_knowledge averaged 280ms on warm connections. This is fine — the model holds state and waits. But you need to handle errors explicitly: if your tool handler throws, you must still send a result event back, or the model hangs waiting for a response that never comes.
client.on('response.function_call_arguments.done', async (event) => {
let result: string;
try {
result = await executeToolHandler(event.name, event.arguments);
} catch (err) {
result = JSON.stringify({ error: 'Tool execution failed', details: String(err) });
}
// Always send the result back, even on error
client.send({
type: 'conversation.item.create',
item: {
type: 'function_call_output',
call_id: event.call_id,
output: result,
},
});
client.send({ type: 'response.create' });
});
The transcript timing problem
This tripped me up on the Visit Agent frontend. There are two separate transcript streams:
response.audio_transcript.delta— transcript of what the model is about to sayconversation.item.input_audio_transcription.completed— transcript of what the user said
The model's transcript arrives before the audio plays. If you render it immediately, users see the agent's words appearing on screen before they hear them — which is uncanny and undermines trust in the voice interaction.
The fix is to buffer the transcript and "drip" it in sync with audio playback:
let wordBuffer: string[] = [];
let dripTimer: ReturnType<typeof setInterval> | null = null;
client.on('output_audio_buffer.started', () => {
// ~3.5 words/sec = natural speech cadence at cedar voice speed
dripTimer = setInterval(() => {
const word = wordBuffer.shift();
if (word) appendToDisplay(word);
}, 285);
});
client.on('output_audio_buffer.stopped', () => {
// Flush remaining buffer instantly when audio ends
if (dripTimer) clearInterval(dripTimer);
appendToDisplay(wordBuffer.join(''));
wordBuffer = [];
});
client.on('response.audio_transcript.delta', ({ delta }) => {
wordBuffer.push(delta); // buffer, do not display yet
});
Session lifecycle and post-call processing
Both the Memory Agent and Visit Agent needed substantial post-call processing: session summaries, knowledge graph extraction, actionable item generation, vector indexing. The pattern that worked was background job queues triggered on session end, with atomic database flags to prevent duplicate processing:
// Atomically claim the summarization job
const log = await db.voiceSession.findOneAndUpdate(
{
sessionId,
summaryGenerating: { $ne: true },
summaryGenerated: { $ne: true },
},
{ $set: { summaryGenerating: true } },
{ returnDocument: 'after' }
);
if (!log) {
// Another worker already claimed it — skip
return;
}
This pattern prevents the most common production bug with voice agents: a user clicking "End Call" twice (or a network retry) causing two summarization jobs to race, double-saving insights, or generating conflicting knowledge graph nodes.
In the Visit Agent, the post-call pipeline was:
- Generate markdown summary from full transcript (gpt-4o-mini, temperature 0.3)
- Extract actionable items as structured JSON (gpt-4o-mini with
response_format) - Index the voice log to a vector store for semantic search across sessions
- Create a thread with the transcript for human review
Each step was idempotent and had its own _generating / _generated flag pair. The pipeline could fail at any step and be safely re-triggered without corrupting state.
What the azure-realtime-webrtc package abstracts
After building the Memory Agent and Visit Agent from scratch — raw WebRTC SDP negotiation, data channel event parsing, audio stream management, the full state machine — I extracted everything reusable into azure-realtime-webrtc.
The package has four entry points:
| Entry point | What it gives you |
|---|---|
azure-realtime-webrtc | Low-level RealtimeClient with typed events for all 32+ server events |
azure-realtime-webrtc/sdk | High-level VoiceAssistant, TextChat, ToolAgent classes |
azure-realtime-webrtc/streaming | Async iterators, ReadableStream, SSE helpers |
azure-realtime-webrtc/server | Express middleware for token server, SDP proxy, rate limiting |
The 5-line version:
import { VoiceAssistant } from 'azure-realtime-webrtc/sdk';
const assistant = new VoiceAssistant({
resource: 'my-azure-resource',
deployment: 'gpt-4o-realtime-preview',
tokenProvider: () =>
fetch('/api/realtime/token', { method: 'POST' })
.then(r => r.json())
.then(d => d.token),
instructions: 'You are a helpful assistant.',
voice: 'alloy',
});
assistant.on('transcript', (entries) => renderConversation(entries));
assistant.on('stateChange', (state) => updateUI(state));
await assistant.start();
The React hook pattern I use in every project now:
export function useVoiceAssistant() {
const ref = useRef<VoiceAssistant | null>(null);
const [state, setState] = useState<VoiceAssistantState>('idle');
const [transcript, setTranscript] = useState<TranscriptEntry[]>([]);
const start = useCallback(async () => {
const a = new VoiceAssistant({ /* config */ });
a.on('stateChange', setState);
a.on('transcript', setTranscript);
ref.current = a;
await a.start();
}, []);
const stop = useCallback(() => {
ref.current?.stop();
ref.current = null;
}, []);
useEffect(() => () => { ref.current?.stop(); }, []);
return { state, transcript, start, stop };
}
Things I wish I knew on day one
1. East US 2 and Sweden Central only. Realtime deployments only work in these two Azure regions. If you deploy your GPT-4o-realtime model in any other region and wonder why SDP negotiation silently fails — now you know.
2. Voice config nesting is not flat. The token POST body has audio.output.voice, audio.input.transcription.model, audio.input.turn_detection. Not voice, transcriptionModel, turnDetection at the top level. Getting this wrong gives you a 500 with no helpful error message.
3. Both transcript event names. Azure emits both response.audio_transcript.delta and response.output_audio_transcript.delta depending on context. Listen to both or you will miss chunks.
4. WebSocket mode for server-side. RTCPeerConnection is not available in Node.js. Use mode: "websocket" for server-side agent execution.
5. The data channel has a 10-second open timeout. If the channel does not open within that window, the session is broken. This usually means the deployment is in the wrong region or the SDP negotiation failed silently. Add explicit timeout handling.
6. Knowledge retrieval must be mandatory, not optional. In the Memory Agent, I put "MANDATORY — call this tool SILENTLY AND IMMEDIATELY when the user asks any question" in the tool description itself. tool_choice: "required" breaks the agent; strongly-worded tool descriptions actually work.
What is next
The next hard problem in voice agents is multi-turn memory across sessions. The Memory Agent handles this through vector search — every saved insight is embedded and retrievable in future sessions. The Visit Agent indexes voice logs so subsequent visits can surface relevant history. Neither approach is ideal because embedding quality degrades on short, spoken utterances versus written prose.
The better answer is probably a hybrid: dense retrieval for factual recall, sparse BM25 for exact term matching, and a lightweight graph for relationship traversal. More on that in a future post.
If you are building on Azure OpenAI Realtime, the package is at npmjs.com/package/azure-realtime-webrtc. Issues and contributions welcome.