Multimodal Vision Q&A hero
Strongly Certified · Streaming Workflow

Multimodal Vision Q&A

Browser WebRTC voice + video in. Audio loop on the wire, video archived to S3.

Browser-side WebRTC carries audio and camera frames to the agent over DTLS+SRTP. The standard streaming voice loop runs on the audio; ImageFrames sample at 1 fps into the recorder, which writes per-frame JPEGs alongside audio.wav and transcript.json at session end.

≤1.5s
First audio response (p95)
1 fps
Default video sample rate
WebRTC
DTLS + SRTP over UDP

The voice loop, end-to-end.

No black box. Each step is a typed-frame node you can edit, monitor, and replace.

01

Browser opens an RTCPeerConnection and sends an SDP offer over the same session WebSocket the runtime already accepts.

02

streaming-webrtc-trigger answers, negotiates ICE, and decodes inbound media. Opus -> ratecv-downsample to 16 kHz PCM as AudioFrames; VP8/VP9/H.264 -> RGB24 ImageFrames at the configured video_sample_hz.

03

Standard voice loop: VAD, STT, turn-detection, conversation-memory, LLM, TTS. The agent's audio reply rides back over the streaming WebSocket as PCM 16 kHz.

04

streaming-recorder buffers audio, transcript, and ImageFrames; on EndFrame writes audio.wav, transcript.json, frames/.jpg per video frame, and frames.json + manifest.json to the S3 archive.

Built for production. Day Two-ready.

Streaming graph contract, observability, and cost discipline come standard. The agent ships with a full test suite that runs in CI on every node version bump.

Browser WebRTC media plane

aiortc 1.13.0 + PyAV 15.0 in the worker image. The browser handles camera/mic permissions; SDP and ICE ride the streaming WebSocket; media flows over DTLS+SRTP/UDP. The pod needs UDP exposure (LoadBalancer or per-tenant TURN).

aiortc 1.13PyAV 15TURN-ready

Audio loop on the wire

Inbound Opus at 48 kHz ratecv-downsampled to 16 kHz PCM AudioFrames. The same VAD, STT, turn-detection, memory, LLM, and TTS nodes the WebSocket-only voice templates use - WebRTC just changes the transport at the edge.

48 kHz -> 16 kHzPhase-continuousSame voice loop

Per-frame JPEG archive

ImageFrames buffer in the recorder; on EndFrame each is encoded to JPEG (Pillow) and written to frames/.jpg with an indexed frames.json. The manifest carries video_frame_count and video_bytes_total so any session is replayable end-to-end.

PillowPer-frame JPEGIndexed manifest

Memory feedback edge

LLM response feeds back into conversation-memory as the assistant turn (ADR-S16 feedback edge with max_iterations: 1000). The graph_validator accepts the cycle.

ADR-S16Coherent memoryValidator-clean

Sample-rate control

video_sample_hz config dials between 0 (audio-only, no ImageFrames) and 5+ fps (denser archive). Default 1 fps balances archival fidelity against egress bandwidth and storage cost.

0-5+ fpsBandwidth knobCost-tunable

Live span tree

ICE handshake state, codec selection, audio/video frame counts, JPEG bytes, recorder writes - all tagged on the call's span tree per ADR-S14. Replay any session, any turn, any time.

ADR-S14Codec spansReplayable

Real services. Your stack.

Every dependency is a registered Strongly service or a model you control. Swap any one of them in the install wizard. The graph stays intact.

WebRTC trigger
streaming-webrtc-trigger - aiortc 1.13.0 + PyAV 15.0
Recorder (v1.2.0)
streaming-recorder - audio.wav + transcript.json + per-frame JPEGs + frames.json
STT / LLM / TTS
All swappable in the wizard - whisper-1 / gpt-4o-mini / tts-1 default
S3 archive
Resolved from STRONGLY_SERVICES via the s3 alias - real S3, R2, MinIO, etc.

Tune it. Don't fork it.

The marketplace template is the graph. Every customisation below is a config change or a single-node addition - never a rewrite.

Video sample rate

rtc_in.config.video_sample_hz default 1.0. Lower to 0.2 for low-bandwidth sessions, raise to 5.0 for denser archive, set 0 to disable video entirely (the trigger becomes audio-only).

TURN servers in production

Default ice_servers is the public Google STUN. Most production deployments need a TURN relay because the streaming worker pod doesn't have a public IP. Provision per-tenant TURN credentials and inject them into rtc_in.config.ice_servers at workflow instantiation.

Different system prompt

Edit memory.config.system_prompt to set the agent's voice, refusal policy, and disclosure language for your product.

Selective archiving

recorder.config.record_audio / record_transcript / record_video each toggle independently. Disable any to keep the archive narrow.

Voice cloning / branded TTS

Pick a different ai_models.tts in the install wizard. The graph stays the same.

Production. Not pilots.

We don't leave until it runs. Talk to a forward-deployed engineer about deploying Multimodal Vision Q&A into your environment with your STT, your LLM, your TTS, your data.

Schedule a Demo