Realtime Voice Agent hero
Strongly Certified · Streaming Workflow

Realtime Voice Agent

Audio-in / audio-out single-node alternative to STT + LLM + TTS.

Browser WebSocket voice agent backed by a full-duplex realtime LLM (OpenAI Realtime, Gemini Live). One model handles VAD, transcription, reasoning, and voice synthesis as one inseparable pipeline. Steerable mid-session via SteerFrame; server-VAD default; per-session S3 archive of audio + transcript.

≤800ms
First audio response (p95)
≤2.5s
Turn complete (p95)
server VAD
Default turn detection

The voice loop, end-to-end.

No black box. Each step is a typed-frame node you can edit, monitor, and replace.

01

Browser opens a WebSocket; streaming-realtime-agent opens a second WebSocket through the AI Gateway's realtime relay to the configured provider (OpenAI Realtime by default).

02

Inbound AudioFrames stream through input_audio_buffer.append; the provider does VAD + transcription + reasoning + voice synthesis as one model.

03

Outbound response.audio.delta events become AudioFrames at the configured output sample rate; the streaming-recorder taps audio + transcript on its way out for the per-session S3 archive.

04

InterruptFrame on the trigger sends response.cancel to the provider for barge-in. SteerFrame mutates voice / instructions / temperature mid-session without ending the conversation.

Built for production. Day Two-ready.

Streaming graph contract, observability, and cost discipline come standard. The agent ships with a full test suite that runs in CI on every node version bump.

Single-node voice loop

Replaces VAD + STT + LLM + TTS with one realtime model. Lower latency, fewer billing lines, fewer integration points. The standard chain stays available for mix-and-match deployments.

aiortc-freeServer VADOne billing line

Steerable mid-session

SteerFrame on steer_in mutates the live session via session.update. Voice + instructions swaps defer until response.done so providers don't reject mid-response changes; temperature + turn-detection swaps apply immediately.

SteerFrameLive mutationNo session restart

Audio-only or audio+text

modalities config swaps between audio (no transcript-delta emission) and audio_text (transcript on text_out). Saves provider tokens when no downstream node consumes the transcript.

modalities configOptional transcriptToken-conscious

Provider-agnostic config

OpenAI Realtime + Gemini Live both supported through the same node via STRONGLY_SERVICES routing. Provider_param_overrides escape-hatch for new provider knobs without forking the node.

AI Gateway routingPer-provider tuningForward-compat

Same span path as batch

Per-frame spans land in workflow_spans (Mongo), same path the batch runtime uses, same canvas viewer. No Prometheus, no Grafana, no external tracing.

ADR-S14Strongly tracingCanvas-first

Typed bridge failures

RealtimeAuthError, RealtimeProtocolError, RealtimeAudioFormatError, RealtimeProviderError, RealtimeReadTimeout. Routed to the auto-injected streaming-errors sink so provider failures never block the voice loop.

Canonical error_typeErrorFrame substrateVoice loop never blocks

Real services. Your stack.

Every dependency is a registered Strongly service or a model you control. Swap any one of them in the install wizard. The graph stays intact.

Realtime node
streaming-realtime-agent 1.0.0 - aiohttp WebSocket through the AI Gateway
Realtime model
OpenAI Realtime (gpt-4o-realtime-preview) by default; Gemini Live also supported
Recorder
streaming-recorder 1.4.0 - audio.wav + transcript.json + manifest.json + spans.json on EndFrame
Strongly tracing
Per-frame spans on workflow_spans (Mongo) - same path as the batch runtime

Tune it. Don't fork it.

The marketplace template is the graph. Every customisation below is a config change or a single-node addition - never a rewrite.

Voice swap

Edit rt.config.voice or send a SteerFrame mid-session. Voice swaps defer until the current response finishes.

Different system prompt

rt.config.system_prompt sets the initial instructions. Use SteerFrame.instructions for runtime mutations.

Audio-only modality

Set rt.config.modalities to audio to skip transcript-delta emission and save tokens.

Tighter audio formats

Default is pcm16_16k on both sides. Switch to pcm16_24k for higher fidelity, or g711_mulaw for telephony bridges.

Provider param overrides

rt.config.provider_param_overrides is a dict merged into the initial session.update for OpenAI-Realtime-specific knobs (max_response_output_tokens, etc.).

Production. Not pilots.

We don't leave until it runs. Talk to a forward-deployed engineer about deploying Realtime Voice Agent into your environment with your STT, your LLM, your TTS, your data.

Schedule a Demo