Cloud Architecture

Real-Time Streaming with Amazon Nova Sonic: Architecture Deep Dive

March 5, 2026•7 min read•...

MDFit Nova-Sonic handles live phone calls, which means every millisecond of latency is perceptible. Here is how we architected the system for real-time streaming.

The Latency Budget

For natural conversation, total round-trip time from user speech to AI response must stay under 1 second. That budget breaks down as:

Speech-to-text: ~200ms
Intent classification + routing: ~100ms
LLM generation: ~400ms
Text-to-speech: ~200ms

Exceed 1.2 seconds and callers perceive the AI as slow. Exceed 2 seconds and they hang up.

Bidirectional WebSocket Architecture

Amazon Nova Sonic supports bidirectional streaming over WebSockets. This is critical because it allows:

Simultaneous input and output

The system can process incoming audio while generating a response. This enables natural turn-taking and even mid-sentence interruption handling.

Streaming token generation

Instead of waiting for a complete response, tokens stream to the TTS engine as they are generated. The first syllable of the response begins playing while the rest is still being generated.

Barge-in detection

If a caller interrupts the AI mid-response, the system detects the barge-in, stops the current response, and processes the new input.

The Twilio Integration

Twilio Media Streams provides the raw audio bridge between the phone network and our WebSocket server.

Caller → Twilio → Media Stream → WebSocket → Nova Sonic → WebSocket → Media Stream → Twilio → Caller

Each hop adds latency, so we colocate our WebSocket server in the same AWS region as the Nova Sonic endpoint.

Optimization Techniques

Connection pooling

Nova Sonic WebSocket connections are expensive to establish. We maintain a warm pool of connections that are reused across calls.

Audio chunking

We send audio in 20ms chunks — small enough for responsive processing, large enough to avoid excessive network overhead.

Response prefetching

For common intents (greeting, hold, transfer), we pre-generate responses and cache the audio. This drops response time to under 200ms for predictable interactions.

Results

With these optimizations, our median response time is 780ms and our p95 is 1.1 seconds. Callers consistently rate the conversation flow as natural.

Deploying Next.js on Vercel with Neon Postgres: Our Production Stack

6 min read

How We Built a Voice AI System That Handles Real Healthcare Calls

8 min read

Multi-Agent Systems: Why One AI Isn't Enough

6 min read

Cloud Architecture

Real-Time Streaming with Amazon Nova Sonic: Architecture Deep Dive

March 5, 2026•7 min read•...

MDFit Nova-Sonic handles live phone calls, which means every millisecond of latency is perceptible. Here is how we architected the system for real-time streaming.

The Latency Budget

For natural conversation, total round-trip time from user speech to AI response must stay under 1 second. That budget breaks down as:

Speech-to-text: ~200ms
Intent classification + routing: ~100ms
LLM generation: ~400ms
Text-to-speech: ~200ms

Exceed 1.2 seconds and callers perceive the AI as slow. Exceed 2 seconds and they hang up.

Bidirectional WebSocket Architecture

Amazon Nova Sonic supports bidirectional streaming over WebSockets. This is critical because it allows:

Simultaneous input and output

The system can process incoming audio while generating a response. This enables natural turn-taking and even mid-sentence interruption handling.

Streaming token generation

Instead of waiting for a complete response, tokens stream to the TTS engine as they are generated. The first syllable of the response begins playing while the rest is still being generated.

Barge-in detection

If a caller interrupts the AI mid-response, the system detects the barge-in, stops the current response, and processes the new input.

The Twilio Integration

Twilio Media Streams provides the raw audio bridge between the phone network and our WebSocket server.

Caller → Twilio → Media Stream → WebSocket → Nova Sonic → WebSocket → Media Stream → Twilio → Caller

Each hop adds latency, so we colocate our WebSocket server in the same AWS region as the Nova Sonic endpoint.

Optimization Techniques

Connection pooling

Nova Sonic WebSocket connections are expensive to establish. We maintain a warm pool of connections that are reused across calls.

Audio chunking

We send audio in 20ms chunks — small enough for responsive processing, large enough to avoid excessive network overhead.

Response prefetching

For common intents (greeting, hold, transfer), we pre-generate responses and cache the audio. This drops response time to under 200ms for predictable interactions.

Results

With these optimizations, our median response time is 780ms and our p95 is 1.1 seconds. Callers consistently rate the conversation flow as natural.

Real-Time Streaming with Amazon Nova Sonic: Architecture Deep Dive

The Latency Budget

Bidirectional WebSocket Architecture

Simultaneous input and output

Streaming token generation

Barge-in detection

The Twilio Integration

Optimization Techniques

Connection pooling

Audio chunking

Response prefetching

Results

Related Posts

Deploying Next.js on Vercel with Neon Postgres: Our Production Stack

How We Built a Voice AI System That Handles Real Healthcare Calls

Multi-Agent Systems: Why One AI Isn't Enough

Real-Time Streaming with Amazon Nova Sonic: Architecture Deep Dive

The Latency Budget

Bidirectional WebSocket Architecture

Simultaneous input and output

Streaming token generation

Barge-in detection

The Twilio Integration

Optimization Techniques

Connection pooling

Audio chunking

Response prefetching

Results

Related Posts

Deploying Next.js on Vercel with Neon Postgres: Our Production Stack

How We Built a Voice AI System That Handles Real Healthcare Calls

Multi-Agent Systems: Why One AI Isn't Enough