Twilio Media Streams in / out, realtime model behind the carrier.
mulaw 8 kHz on the wire end-to-end - the realtime model emits g711_mulaw 8 kHz directly so the response side passes audio through without re-encoding. Single-node alternative to STT + LLM + TTS for telephony.
No black box. Each step is a typed-frame node you can edit, monitor, and replace.
Twilio answers the call (or your dialler initiates outbound). Twilio's TwiML hands the audio to a Media Streams WebSocket pointing at this workflow's session.
streaming-twilio-trigger 1.1.0 receives mulaw 8 kHz envelopes and forwards 8 kHz PCM AudioFrames; streaming-realtime-agent consumes via input_audio_format=g711_mulaw and emits output_audio_format=g711_mulaw directly.
streaming-twilio-response 1.1.0 wraps the model's mulaw audio in Twilio media envelopes and sends back to the carrier. No PCM ratecv on either side of the wire.
DTMF or VAD-driven InterruptFrame on the trigger sends response.cancel to the realtime model for barge-in. Audio + transcript archive to S3 on EndFrame.
Streaming graph contract, observability, and cost discipline come standard. The agent ships with a full test suite that runs in CI on every node version bump.
streaming-twilio-trigger + streaming-twilio-response 1.1.0+. Carrier-agnostic envelope_spec config for Bandwidth Voice / internal SIP bridges; default config is Twilio.
Replaces VAD + STT + LLM + TTS with one realtime model. mulaw 8 kHz stays on the wire end-to-end - no PCM ratecv, no extra audioop hops.
interrupt_dtmf_digit emits InterruptFrame on configured digit; the realtime node sends response.cancel to the provider; outbound mulaw audio buffered at the carrier flushes via Twilio clear envelope.
telephony pair handles the carrier protocol; realtime node handles the model; recorder archives the call. Each piece is independently swappable; the loop shape matches the WebSocket-driven realtime template.
Wire a SteerFrame source into steer_in (e.g. an operator dashboard signal) to swap voice or instructions mid-call without ending the conversation. Defers voice swaps until response.done so the carrier doesn't get a half-rendered chunk.
Per-frame spans on workflow_spans + carrier IDs (stream_sid, call_sid) tagged on the call's span tree per ADR-S14. Replay any call, any turn, any time.
Every dependency is a registered Strongly service or a model you control. Swap any one of them in the install wizard. The graph stays intact.
The marketplace template is the graph. Every customisation below is a config change or a single-node addition - never a rewrite.
The graph is identical; only Twilio call origination changes. Outbound: your dialler creates the call and points the answer-TwiML at this workflow's session WebSocket.
Edit tel_in.config.interrupt_dtmf_digit. Empty disables DTMF barge-in (VAD-only still works).
rt.config.voice sets the initial voice; SteerFrame mid-call swaps voices (deferred until response.done).
rt.config.system_prompt sets the agent's voice + tone + refusal policy. SteerFrame.instructions for runtime mutations.
Set tel_in.config.envelope_spec + tel_out.config.response_envelope_spec to the carrier's shape; codec stays mulaw 8 kHz on the wire if the carrier is G.711.
We don't leave until it runs. Talk to a forward-deployed engineer about deploying Realtime Telephony Agent into your environment with your STT, your LLM, your TTS, your data.
Schedule a Demo