Audio-in / audio-out single-node alternative to STT + LLM + TTS.
Browser WebSocket voice agent backed by a full-duplex realtime LLM (OpenAI Realtime, Gemini Live). One model handles VAD, transcription, reasoning, and voice synthesis as one inseparable pipeline. Steerable mid-session via SteerFrame; server-VAD default; per-session S3 archive of audio + transcript.
No black box. Each step is a typed-frame node you can edit, monitor, and replace.
Browser opens a WebSocket; streaming-realtime-agent opens a second WebSocket through the AI Gateway's realtime relay to the configured provider (OpenAI Realtime by default).
Inbound AudioFrames stream through input_audio_buffer.append; the provider does VAD + transcription + reasoning + voice synthesis as one model.
Outbound response.audio.delta events become AudioFrames at the configured output sample rate; the streaming-recorder taps audio + transcript on its way out for the per-session S3 archive.
InterruptFrame on the trigger sends response.cancel to the provider for barge-in. SteerFrame mutates voice / instructions / temperature mid-session without ending the conversation.
Streaming graph contract, observability, and cost discipline come standard. The agent ships with a full test suite that runs in CI on every node version bump.
Replaces VAD + STT + LLM + TTS with one realtime model. Lower latency, fewer billing lines, fewer integration points. The standard chain stays available for mix-and-match deployments.
SteerFrame on steer_in mutates the live session via session.update. Voice + instructions swaps defer until response.done so providers don't reject mid-response changes; temperature + turn-detection swaps apply immediately.
modalities config swaps between audio (no transcript-delta emission) and audio_text (transcript on text_out). Saves provider tokens when no downstream node consumes the transcript.
OpenAI Realtime + Gemini Live both supported through the same node via STRONGLY_SERVICES routing. Provider_param_overrides escape-hatch for new provider knobs without forking the node.
Per-frame spans land in workflow_spans (Mongo), same path the batch runtime uses, same canvas viewer. No Prometheus, no Grafana, no external tracing.
RealtimeAuthError, RealtimeProtocolError, RealtimeAudioFormatError, RealtimeProviderError, RealtimeReadTimeout. Routed to the auto-injected streaming-errors sink so provider failures never block the voice loop.
Every dependency is a registered Strongly service or a model you control. Swap any one of them in the install wizard. The graph stays intact.
The marketplace template is the graph. Every customisation below is a config change or a single-node addition - never a rewrite.
Edit rt.config.voice or send a SteerFrame mid-session. Voice swaps defer until the current response finishes.
rt.config.system_prompt sets the initial instructions. Use SteerFrame.instructions for runtime mutations.
Set rt.config.modalities to audio to skip transcript-delta emission and save tokens.
Default is pcm16_16k on both sides. Switch to pcm16_24k for higher fidelity, or g711_mulaw for telephony bridges.
rt.config.provider_param_overrides is a dict merged into the initial session.update for OpenAI-Realtime-specific knobs (max_response_output_tokens, etc.).
We don't leave until it runs. Talk to a forward-deployed engineer about deploying Realtime Voice Agent into your environment with your STT, your LLM, your TTS, your data.
Schedule a Demo