Real-Time Streaming with Amazon Nova Sonic: Architecture Deep Dive
MDFit Nova-Sonic handles live phone calls, which means every millisecond of latency is perceptible. Here is how we architected the system for real-time streaming.
The Latency Budget
For natural conversation, total round-trip time from user speech to AI response must stay under 1 second. That budget breaks down as:
- Speech-to-text: ~200ms
- Intent classification + routing: ~100ms
- LLM generation: ~400ms
- Text-to-speech: ~200ms
Exceed 1.2 seconds and callers perceive the AI as slow. Exceed 2 seconds and they hang up.
Bidirectional WebSocket Architecture
Amazon Nova Sonic supports bidirectional streaming over WebSockets. This is critical because it allows:
Simultaneous input and output
The system can process incoming audio while generating a response. This enables natural turn-taking and even mid-sentence interruption handling.
Streaming token generation
Instead of waiting for a complete response, tokens stream to the TTS engine as they are generated. The first syllable of the response begins playing while the rest is still being generated.
Barge-in detection
If a caller interrupts the AI mid-response, the system detects the barge-in, stops the current response, and processes the new input.
The Twilio Integration
Twilio Media Streams provides the raw audio bridge between the phone network and our WebSocket server.
Caller → Twilio → Media Stream → WebSocket → Nova Sonic → WebSocket → Media Stream → Twilio → Caller
Each hop adds latency, so we colocate our WebSocket server in the same AWS region as the Nova Sonic endpoint.
Optimization Techniques
Connection pooling
Nova Sonic WebSocket connections are expensive to establish. We maintain a warm pool of connections that are reused across calls.
Audio chunking
We send audio in 20ms chunks — small enough for responsive processing, large enough to avoid excessive network overhead.
Response prefetching
For common intents (greeting, hold, transfer), we pre-generate responses and cache the audio. This drops response time to under 200ms for predictable interactions.
Results
With these optimizations, our median response time is 780ms and our p95 is 1.1 seconds. Callers consistently rate the conversation flow as natural.