Streaming LLM Responses: Building Real-Time User Experiences
Implement streaming responses that make AI agents feel fast and responsive - covering SSE setup, token-by-token rendering, progress indicators, and handling partial failures gracefully.

TL;DR
- Streaming reduces perceived latency by 60-80% - users see tokens immediately instead of waiting for full responses.
- Server-Sent Events (SSE) is simpler than WebSockets for one-way streaming; use WebSockets only if you need bidirectional.
- Buffer intelligently: show tokens as they arrive, but batch DOM updates to avoid janky rendering.
- Handle partial failures: if streaming stops mid-response, save what you have and offer retry.
Jump to Why streaming matters · Jump to Architecture options · Jump to Implementation guide · Jump to UX patterns
# Streaming LLM Responses: Building Real-Time User Experiences
Click send, wait 3 seconds, see nothing. Wait another 2 seconds, still nothing. Finally - a wall of text appears. That's the non-streaming experience. Users don't know if the system is working, broken, or just slow.
Streaming responses change this fundamentally. The first token appears within 200-400ms. Users see the response being generated character by character, giving immediate feedback that the system is working. Even though the total time to complete is similar, perceived performance improves dramatically.
This guide covers implementing streaming from LLM APIs through to frontend rendering, with patterns that handle real-world complications like errors, tool calls, and mobile connectivity.
Key takeaways - Streaming is primarily a UX improvement - total time to full response is often the same or slightly longer. - Time to First Token (TTFT) is the metric that matters; aim for <500ms. - Don't render every token individually - batch updates every 50-100ms for smooth animation. - Plan for stream interruptions: save partial responses, show graceful errors, offer retry.
Why streaming matters
Research from Stanford's HCI lab found that users perceive streaming chat interfaces as 40% faster than non-streaming interfaces, even when total response time is identical (Stanford HCI, 2024). The psychology is simple: visible progress feels faster than invisible waiting.
Latency breakdown
Non-streaming request:
User sends message
→ Network latency (~50ms)
→ LLM processing (2,000-8,000ms)
→ Network latency (~50ms)
→ Render response (~10ms)
Total time to any feedback: 2,100-8,100msStreaming request:
User sends message
→ Network latency (~50ms)
→ LLM starts generating (200-400ms)
→ First token arrives
→ Tokens stream continuously
→ Final token arrives (2,000-8,000ms total)
Total time to first feedback: 250-450msThe user sees activity almost immediately with streaming. That changes the entire perception of speed.
When streaming helps most
| Scenario | Benefit | Priority |
|---|---|---|
| Long responses (500+ tokens) | High - visible progress during generation | Essential |
| User-facing chat | High - immediate feedback | Essential |
| Slow models (GPT-4, Claude Opus) | High - reduces perceived wait | High |
| Fast models (GPT-4o-mini) | Medium - already fast | Nice to have |
| Background processing | Low - user isn't watching | Skip |
| Batch operations | None - no UI to update | Skip |
"The developer experience improvements we've seen from AI tools are the most significant since IDEs and version control. This is a permanent shift in how software gets built." - Emily Freeman, VP of Developer Relations at AWS
Architecture options
Three main approaches exist for streaming from LLM to browser.
Option 1: Server-Sent Events (SSE)
One-way stream from server to client. Simplest option for most use cases.
[Client] → HTTP POST /api/chat (message)
[Server] ← Opens SSE connection
[Server] → data: {"token": "Hello"}
[Server] → data: {"token": " world"}
[Server] → data: [DONE]
[Server] ← Connection closesPros:
- Built on HTTP - works with existing infrastructure
- Automatic reconnection in browsers
- Simple to implement
Cons:
- One-way only (client can't send during stream)
- Some corporate firewalls/proxies struggle with it
Option 2: WebSockets
Bidirectional persistent connection. More complex but more flexible.
[Client] ↔ WebSocket connection established
[Client] → {"type": "message", "content": "Hello"}
[Server] → {"type": "token", "content": "Hi"}
[Server] → {"type": "token", "content": " there"}
[Client] → {"type": "cancel"} (user can interrupt)
[Server] → {"type": "done"}Pros:
- Bidirectional - can cancel, send follow-ups
- Lower overhead per message
- Better for high-frequency updates
Cons:
- Requires WebSocket infrastructure
- Doesn't work through some proxies
- More complex state management
Option 3: HTTP streaming (chunked transfer)
Uses standard HTTP with chunked transfer encoding. Good compatibility.
[Client] → POST /api/chat
[Server] ← Transfer-Encoding: chunked
[Server] → chunk: Hello
[Server] → chunk: world
[Server] → 0 (end of chunks)Pros:
- Works everywhere HTTP works
- No special protocol needed
Cons:
- Harder to handle errors mid-stream
- Limited browser APIs for reading chunks
Recommendation: Use SSE for most applications. It's simple, well-supported, and handles 95% of use cases. Switch to WebSockets only if you need bidirectional communication (e.g., user can cancel mid-stream).
Implementation guide
Let's build a complete streaming implementation with SSE.
Step 1: Server-side SSE endpoint (Next.js)
// app/api/chat/route.ts
import { OpenAI } from 'openai';
export const runtime = 'nodejs';
export async function POST(request: Request) {
const { messages } = await request.json();
const openai = new OpenAI();
// Create streaming response
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages,
stream: true
});
// Convert to SSE format
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
try {
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
// SSE format: "data: {json}\n\n"
const data = JSON.stringify({ type: 'token', content });
controller.enqueue(encoder.encode(`data: ${data}\n\n`));
}
// Check for finish reason
if (chunk.choices[0]?.finish_reason) {
const done = JSON.stringify({
type: 'done',
reason: chunk.choices[0].finish_reason
});
controller.enqueue(encoder.encode(`data: ${done}\n\n`));
}
}
} catch (error) {
const errorData = JSON.stringify({
type: 'error',
message: error.message
});
controller.enqueue(encoder.encode(`data: ${errorData}\n\n`));
} finally {
controller.close();
}
}
});
return new Response(readable, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive'
}
});
}Step 2: Client-side SSE consumption
// hooks/useStreamingChat.ts
import { useState, useCallback } from 'react';
interface StreamState {
content: string;
isStreaming: boolean;
error: string | null;
}
export function useStreamingChat() {
const [state, setState] = useState<StreamState>({
content: '',
isStreaming: false,
error: null
});
const sendMessage = useCallback(async (message: string) => {
setState({ content: '', isStreaming: true, error: null });
try {
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
messages: [{ role: 'user', content: message }]
})
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
const reader = response.body!.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Parse SSE events from buffer
const events = buffer.split('\n\n');
buffer = events.pop() || ''; // Keep incomplete event in buffer
for (const event of events) {
if (!event.startsWith('data: ')) continue;
const json = event.slice(6); // Remove "data: " prefix
const data = JSON.parse(json);
if (data.type === 'token') {
setState(prev => ({
...prev,
content: prev.content + data.content
}));
} else if (data.type === 'done') {
setState(prev => ({ ...prev, isStreaming: false }));
} else if (data.type === 'error') {
setState(prev => ({
...prev,
isStreaming: false,
error: data.message
}));
}
}
}
} catch (error) {
setState(prev => ({
...prev,
isStreaming: false,
error: error.message
}));
}
}, []);
return { ...state, sendMessage };
}Step 3: Rendering with batched updates
Rendering every token individually causes janky animations. Batch updates for smooth rendering.
// hooks/useBufferedContent.ts
import { useState, useEffect, useRef } from 'react';
export function useBufferedContent(
streamContent: string,
bufferInterval: number = 50
) {
const [displayContent, setDisplayContent] = useState('');
const pendingRef = useRef(streamContent);
useEffect(() => {
pendingRef.current = streamContent;
}, [streamContent]);
useEffect(() => {
const interval = setInterval(() => {
if (pendingRef.current !== displayContent) {
setDisplayContent(pendingRef.current);
}
}, bufferInterval);
return () => clearInterval(interval);
}, [displayContent, bufferInterval]);
return displayContent;
}
// Usage in component
function ChatMessage({ streamContent, isStreaming }) {
const displayContent = useBufferedContent(streamContent, 50);
return (
<div className="message">
<ReactMarkdown>{displayContent}</ReactMarkdown>
{isStreaming && <span className="cursor-blink">▋</span>}
</div>
);
}Step 4: Handling tool calls in streams
When agents use tools, handle them gracefully in the stream.
interface StreamEvent {
type: 'token' | 'tool_call' | 'tool_result' | 'done' | 'error';
content?: string;
toolCall?: {
id: string;
name: string;
arguments: string;
};
toolResult?: {
id: string;
result: any;
};
}
async function processStreamWithTools(stream: AsyncIterable<any>) {
const pendingToolCalls: Map<string, any> = new Map();
let accumulatedContent = '';
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta;
// Handle content tokens
if (delta?.content) {
yield { type: 'token', content: delta.content };
accumulatedContent += delta.content;
}
// Handle tool calls
if (delta?.tool_calls) {
for (const toolCall of delta.tool_calls) {
const existing = pendingToolCalls.get(toolCall.index) || {
id: '',
name: '',
arguments: ''
};
if (toolCall.id) existing.id = toolCall.id;
if (toolCall.function?.name) existing.name = toolCall.function.name;
if (toolCall.function?.arguments) {
existing.arguments += toolCall.function.arguments;
}
pendingToolCalls.set(toolCall.index, existing);
}
}
// Check if tool calls are complete
const finishReason = chunk.choices[0]?.finish_reason;
if (finishReason === 'tool_calls') {
for (const [_, toolCall] of pendingToolCalls) {
yield {
type: 'tool_call',
toolCall: {
id: toolCall.id,
name: toolCall.name,
arguments: toolCall.arguments
}
};
// Execute tool and yield result
const result = await executeTool(toolCall.name, JSON.parse(toolCall.arguments));
yield {
type: 'tool_result',
toolResult: { id: toolCall.id, result }
};
}
pendingToolCalls.clear();
}
if (finishReason === 'stop') {
yield { type: 'done' };
}
}
}UX patterns
Technical implementation is half the battle. Good UX patterns make streaming feel polished.
Pattern 1: Typing indicators before stream starts
Show a typing indicator during the 200-400ms before the first token.
function ChatInterface() {
const [showTyping, setShowTyping] = useState(false);
const { content, isStreaming, sendMessage } = useStreamingChat();
const handleSend = async (message: string) => {
setShowTyping(true);
await sendMessage(message);
};
// Hide typing indicator once content starts arriving
useEffect(() => {
if (content.length > 0) {
setShowTyping(false);
}
}, [content]);
return (
<div>
{showTyping && <TypingIndicator />}
{content && <StreamingMessage content={content} />}
</div>
);
}Pattern 2: Smooth cursor animation
A blinking cursor at the end of streaming content feels natural.
.cursor-blink {
animation: blink 1s step-end infinite;
}
@keyframes blink {
50% { opacity: 0; }
}
/* Fade out cursor when streaming stops */
.cursor-fade {
animation: fadeOut 0.3s ease-out forwards;
}
@keyframes fadeOut {
to { opacity: 0; }
}Pattern 3: Progress for long operations
For responses that include processing steps, show progress.
function StreamingResponse({ events }) {
return (
<div>
{events.map((event, i) => {
if (event.type === 'token') {
return <span key={i}>{event.content}</span>;
}
if (event.type === 'tool_call') {
return (
<div key={i} className="tool-indicator">
<Spinner size="sm" />
<span>Searching: {event.toolCall.name}...</span>
</div>
);
}
if (event.type === 'tool_result') {
return (
<div key={i} className="tool-complete">
<CheckIcon />
<span>Found {event.toolResult.result.count} results</span>
</div>
);
}
return null;
})}
</div>
);
}Pattern 4: Graceful error recovery
When streams fail mid-response, preserve partial content.
function useResilientStream() {
const [partialContent, setPartialContent] = useState('');
const [error, setError] = useState<string | null>(null);
const sendWithRecovery = async (message: string) => {
setError(null);
try {
await sendMessage(message);
} catch (error) {
// Preserve what we received
if (partialContent.length > 50) {
setError(`Response interrupted. Showing partial response (${partialContent.length} characters received).`);
// Don't clear partialContent - show what we have
} else {
setError('Failed to get response. Please try again.');
setPartialContent('');
}
}
};
return { partialContent, error, sendWithRecovery };
}Performance optimisation
Measure TTFT (Time to First Token)
async function measureTTFT(sendMessage: () => Promise<void>) {
const start = performance.now();
let ttft: number | null = null;
const observer = new MutationObserver(() => {
if (ttft === null) {
ttft = performance.now() - start;
observer.disconnect();
console.log(`TTFT: ${ttft.toFixed(0)}ms`);
}
});
observer.observe(document.querySelector('.message-container')!, {
childList: true,
subtree: true,
characterData: true
});
await sendMessage();
}Optimise for mobile
Mobile networks have higher latency. Adjust buffering and show loading states earlier.
const isMobile = /iPhone|iPad|iPod|Android/i.test(navigator.userAgent);
const bufferInterval = isMobile ? 100 : 50; // More buffering on mobile
const loadingTimeout = isMobile ? 500 : 300; // Show loading earlier on mobileFAQs
Should I stream short responses?
For responses under 100 tokens, streaming adds complexity without much UX benefit. Consider a threshold - stream responses expected to be long, don't stream short ones.
How do I handle markdown rendering during streaming?
Render incrementally, but be careful with incomplete markdown. Either wait for complete blocks or use a parser that handles partial markdown gracefully.
Can I stream structured outputs?
Partial JSON isn't valid JSON. Either stream as raw text and parse at the end, or use function calling where the complete result arrives atomically.
What about rate limiting?
Streaming connections count toward connection limits, not rate limits. However, open streams consume server resources. Set reasonable timeouts and close idle connections.
How do I test streaming implementations?
Create mock streams that emit tokens at controlled intervals. Test both happy paths and interruptions.
Summary and next steps
Streaming transforms how users perceive AI agent speed. The implementation is straightforward with SSE, but details like buffering, error handling, and progress indicators separate good implementations from great ones.
Implementation checklist:
- Set up SSE endpoint with proper headers
- Implement client-side event parsing
- Add buffered rendering (50-100ms batches)
- Build typing indicator for pre-stream phase
- Handle stream interruptions gracefully
- Add tool call progress indicators
Quick wins:
- Replace synchronous calls with streaming for long responses
- Add typing indicator during TTFT window
- Implement basic buffering for smooth rendering
Internal links:
- /blog/building-conversational-ai-agents-context-management
- /blog/ai-agent-retry-strategies-exponential-backoff
- /blog/multi-agent-orchestration-implementation-guide
External references:
More from the blog
OpenHelm vs runCLAUDErun: Which Claude Code Scheduler Is Right for You?
A direct comparison of the two most popular Claude Code schedulers, how each works, what each costs, and which fits your workflow.
Claude Code vs Cursor Pro: Real Developer Cost Comparison
An honest look at what developers actually spend on Claude Code, Cursor Pro, and GitHub Copilot, and how to get the most from each.