Concept: Streaming & Response Control
Overview
This example demonstrates streaming responses and token limits, two essential techniques for building responsive AI agents with controlled output.
The Streaming Problem
Traditional (Non-Streaming) Approach
User sends prompt
↓
[Wait 10 seconds...]
↓
Complete response appears all at once
Problems:
- Poor user experience (long wait)
- No progress indication
- Can’t interrupt bad responses
- Feels unresponsive
Streaming Approach (This Example)
User sends prompt
↓
"Hoisting" (0.1s) → User sees first word!
↓
"is a" (0.2s) → More text appears
↓
"JavaScript" (0.3s) → Continuous feedback
↓
[Continues token by token...]
Benefits:
- Immediate feedback
- Progress visible
- Can interrupt early
- Feels interactive
How Streaming Works
Token-by-Token Generation
LLMs generate one token at a time internally. Streaming exposes this:
Internal LLM Process:
┌─────────────────────────────────────┐
│ Token 1: "Hoisting" │
│ Token 2: "is" │
│ Token 3: "a" │
│ Token 4: "JavaScript" │
│ Token 5: "mechanism" │
│ ... │
└─────────────────────────────────────┘
Without Streaming: With Streaming:
Wait for all tokens Emit each token immediately
└─→ Buffer → Return └─→ Callback → Display
The onTextChunk Callback
┌────────────────────────────────────┐
│ Model Generation │
└────────────┬───────────────────────┘
│
┌────────┴─────────┐
│ Each new token │
└────────┬─────────┘
↓
┌────────────────────┐
│ onTextChunk(text) │ ← Your callback
└────────┬───────────┘
↓
Your code processes it:
• Display to user
• Send over network
• Log to file
• Analyze content
Token Limits: maxTokens
Why Limit Output?
Without limits, models might generate:
User: "Explain hoisting"
Model: [Generates 10,000 words including:
- Complete JavaScript history
- Every edge case
- Unrelated examples
- Never stops...]
With limits:
User: "Explain hoisting"
Model: [Generates ~1500 words
- Core concept
- Key examples
- Stops at 2000 tokens]
Token Budgeting
Context Window: 4096 tokens
├─ System Prompt: 200 tokens
├─ User Message: 100 tokens
├─ Response (maxTokens): 2000 tokens
└─ Remaining for history: 1796 tokens
Total used: 2300 tokens
Available: 1796 tokens for future conversation
Cost vs Quality
Token Limit Output Quality Use Case
─────────── ─────────────── ─────────────────
100 Brief, may be cut Quick answers
500 Concise but complete Short explanations
2000 (example) Detailed Full explanations
No limit Risk of rambling When length unknown
Real-Time Applications
Pattern 1: Interactive CLI
User: "Explain closures"
↓
Terminal: "A closure is a function..."
(Appears word by word, like typing)
↓
User sees progress, knows it's working
Pattern 2: Web Application
Browser Server
│ │
├─── Send prompt ────────→│
│ │
│←── Chunk 1: "Closures"──┤
│ (Display immediately) │
│ │
│←── Chunk 2: "are"───────┤
│ (Append to display) │
│ │
│←── Chunk 3: "functions"─┤
│ (Keep appending...) │
Implementation:
- Server-Sent Events (SSE)
- WebSockets
- HTTP streaming
Pattern 3: Multi-Consumer
onTextChunk(text)
│
┌───────┼───────┐
↓ ↓ ↓
Console WebSocket Log File
Display → Client → Storage
Performance Characteristics
Latency vs Throughput
Time to First Token (TTFT):
├─ Small model (1.7B): ~100ms
├─ Medium model (8B): ~200ms
└─ Large model (20B): ~500ms
Tokens Per Second:
├─ Small model: 50-80 tok/s
├─ Medium model: 20-35 tok/s
└─ Large model: 10-15 tok/s
User Experience:
TTFT < 500ms → Feels instant
Tok/s > 20 → Reads naturally
Resource Trade-offs
Model Size Memory Speed Quality
────────── ──────── ───── ───────
1.7B ~2GB Fast Good
8B ~6GB Medium Better
20B ~12GB Slower Best
Advanced Concepts
Buffering Strategies
No Buffer (Immediate)
Every token → callback → display
└─ Smoothest UX but more overhead
Line Buffer
Accumulate until newline → flush
└─ Better for paragraph-based output
Time Buffer
Accumulate for 50ms → flush batch
└─ Reduces callback frequency
Early Stopping
Generation in progress:
"The answer is clearly... wait, actually..."
↑
onTextChunk detects issue
↓
Stop generation
↓
"Let me reconsider"
Useful for:
- Detecting off-topic responses
- Safety filters
- Relevance checking
Progressive Enhancement
Partial Response Analysis:
┌─────────────────────────────────┐
│ "To implement this feature..." │
│ │
│ ← Already useful information │
│ │
│ "...you'll need: 1) Node.js" │
│ │
│ ← Can start acting on this │
│ │
│ "2) Express framework" │
└─────────────────────────────────┘
Agent can begin working before response completes!
Context Size Awareness
Why It Matters
┌────────────────────────────────┐
│ Context Window (4096) │
├────────────────────────────────┤
│ System Prompt 200 tokens │
│ Conversation History 1000 │
│ Current Prompt 100 │
│ Response Space 2796 │
└────────────────────────────────┘
If maxTokens > 2796:
└─→ Error or truncation!
Dynamic Adjustment
Available = contextSize - (prompt + history)
if (maxTokens > available) {
maxTokens = available;
// or clear old history
}
Streaming in Agent Architectures
Simple Agent
User → LLM (streaming) → Display
└─ onTextChunk shows progress
Multi-Step Agent
Step 1: Plan (stream) → Show thinking
Step 2: Act (stream) → Show action
Step 3: Result (stream) → Show outcome
└─ User sees agent's process
Collaborative Agents
Agent A (streaming) ──┐
├─→ Coordinator → User
Agent B (streaming) ──┘
└─ Both stream simultaneously
Best Practices
1. Always Set maxTokens
✓ Good:
session.prompt(query, { maxTokens: 2000 })
✗ Risky:
session.prompt(query)
└─ May use entire context!
2. Handle Partial Updates
let fullResponse = '';
onTextChunk: (chunk) => {
fullResponse += chunk;
display(chunk); // Show immediately
logComplete = false; // Mark incomplete
}
// After completion:
saveToDatabase(fullResponse);
3. Provide Feedback
onTextChunk: (chunk) => {
if (firstChunk) {
showLoadingDone();
firstChunk = false;
}
appendToDisplay(chunk);
}
4. Monitor Performance
const startTime = Date.now();
let tokenCount = 0;
onTextChunk: (chunk) => {
tokenCount += estimateTokens(chunk);
const elapsed = (Date.now() - startTime) / 1000;
const tokensPerSecond = tokenCount / elapsed;
updateMetrics(tokensPerSecond);
}
Key Takeaways
- Streaming improves UX: Users see progress immediately
- maxTokens controls cost: Prevents runaway generation
- Token-by-token generation: LLMs produce one token at a time
- onTextChunk callback: Your hook into the generation process
- Context awareness matters: Monitor available space
- Essential for production: Real-time systems need streaming
Comparison
Feature intro.js coding.js (this)
──────────────── ───────── ─────────────────
Streaming ✗ ✓
Token limit ✗ ✓ (2000)
Real-time output ✗ ✓
Progress visible ✗ ✓
User control ✗ ✓
This pattern is foundational for building responsive, user-friendly AI agent interfaces.