Browser WebRTC voice + video in. Audio loop on the wire, video archived to S3.
Browser-side WebRTC carries audio and camera frames to the agent over DTLS+SRTP. The standard streaming voice loop runs on the audio; ImageFrames sample at 1 fps into the recorder, which writes per-frame JPEGs alongside audio.wav and transcript.json at session end.
No black box. Each step is a typed-frame node you can edit, monitor, and replace.
Browser opens an RTCPeerConnection and sends an SDP offer over the same session WebSocket the runtime already accepts.
streaming-webrtc-trigger answers, negotiates ICE, and decodes inbound media. Opus -> ratecv-downsample to 16 kHz PCM as AudioFrames; VP8/VP9/H.264 -> RGB24 ImageFrames at the configured video_sample_hz.
Standard voice loop: VAD, STT, turn-detection, conversation-memory, LLM, TTS. The agent's audio reply rides back over the streaming WebSocket as PCM 16 kHz.
streaming-recorder buffers audio, transcript, and ImageFrames; on EndFrame writes audio.wav, transcript.json, frames/
Streaming graph contract, observability, and cost discipline come standard. The agent ships with a full test suite that runs in CI on every node version bump.
aiortc 1.13.0 + PyAV 15.0 in the worker image. The browser handles camera/mic permissions; SDP and ICE ride the streaming WebSocket; media flows over DTLS+SRTP/UDP. The pod needs UDP exposure (LoadBalancer or per-tenant TURN).
Inbound Opus at 48 kHz ratecv-downsampled to 16 kHz PCM AudioFrames. The same VAD, STT, turn-detection, memory, LLM, and TTS nodes the WebSocket-only voice templates use - WebRTC just changes the transport at the edge.
ImageFrames buffer in the recorder; on EndFrame each is encoded to JPEG (Pillow) and written to frames/
LLM response feeds back into conversation-memory as the assistant turn (ADR-S16 feedback edge with max_iterations: 1000). The graph_validator accepts the cycle.
video_sample_hz config dials between 0 (audio-only, no ImageFrames) and 5+ fps (denser archive). Default 1 fps balances archival fidelity against egress bandwidth and storage cost.
ICE handshake state, codec selection, audio/video frame counts, JPEG bytes, recorder writes - all tagged on the call's span tree per ADR-S14. Replay any session, any turn, any time.
Every dependency is a registered Strongly service or a model you control. Swap any one of them in the install wizard. The graph stays intact.
The marketplace template is the graph. Every customisation below is a config change or a single-node addition - never a rewrite.
rtc_in.config.video_sample_hz default 1.0. Lower to 0.2 for low-bandwidth sessions, raise to 5.0 for denser archive, set 0 to disable video entirely (the trigger becomes audio-only).
Default ice_servers is the public Google STUN. Most production deployments need a TURN relay because the streaming worker pod doesn't have a public IP. Provision per-tenant TURN credentials and inject them into rtc_in.config.ice_servers at workflow instantiation.
Edit memory.config.system_prompt to set the agent's voice, refusal policy, and disclosure language for your product.
recorder.config.record_audio / record_transcript / record_video each toggle independently. Disable any to keep the archive narrow.
Pick a different ai_models.tts in the install wizard. The graph stays the same.
We don't leave until it runs. Talk to a forward-deployed engineer about deploying Multimodal Vision Q&A into your environment with your STT, your LLM, your TTS, your data.
Schedule a Demo