Concept: Parallel Processing & Performance Optimization
Overview
This example demonstrates concurrent execution of multiple LLM requests using separate context sequences, a critical technique for building scalable AI agent systems.
The Performance Problem
Sequential Processing (Slow)
Traditional approach processes one request at a time:
Request 1 ────────→ Response 1 (2s)
↓
Request 2 ────────→ Response 2 (2s)
↓
Total: 4 seconds
Parallel Processing (Fast)
This example processes multiple requests simultaneously:
Request 1 ────────→ Response 1 (2s) ──┐
├→ Total: 2 seconds
Request 2 ────────→ Response 2 (2s) ──┘
(Both running at the same time)
Performance gain: 2x speedup!
Core Concept: Context Sequences
Single vs. Multiple Sequences
┌────────────────────────────────────────────────┐
│ Model (Loaded Once) │
├────────────────────────────────────────────────┤
│ Context │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Sequence 1 │ │ Sequence 2 │ │
│ │ │ │ │ │
│ │ Conversation │ │ Conversation │ │
│ │ History A │ │ History B │ │
│ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────┘
Key insights:
- Model weights are shared (memory efficient)
- Each sequence has independent history
- Sequences can process in parallel
- Both use the same underlying model
How Parallel Processing Works
Promise.all Pattern
JavaScript’s Promise.all() enables concurrent execution:
Sequential:
────────────────────────────────────
await fn1(); // Wait 2s
await fn2(); // Wait 2s more
Total: 4s
Parallel:
────────────────────────────────────
await Promise.all([
fn1(), // Start immediately
fn2() // Start immediately (don't wait!)
]);
Total: 2s (whichever finishes last)
Execution Timeline
Time → 0s 1s 2s 3s 4s
│ │ │ │ │
Seq 1: ├───────Processing───────┤
│ └─ Response 1
│
Seq 2: ├───────Processing───────┤
└─ Response 2
Both complete at ~2s instead of 4s!
GPU Batch Processing
Why Batching Matters
Modern GPUs process multiple operations efficiently:
Without Batching (Inefficient)
──────────────────────────────
GPU: [Token 1] ... wait ...
GPU: [Token 2] ... wait ...
GPU: [Token 3] ... wait ...
└─ GPU underutilized
With Batching (Efficient)
─────────────────────────
GPU: [Tokens 1-1024] ← Full batch
└─ GPU fully utilized!
batchSize parameter: Controls how many tokens process together.
Trade-offs
Small Batch (e.g., 128) Large Batch (e.g., 2048)
─────────────────────── ────────────────────────
✓ Lower memory ✓ Better GPU utilization
✓ More flexible ✓ Faster throughput
✗ Slower throughput ✗ Higher memory usage
✗ GPU underutilized ✗ May exceed VRAM
Sweet spot: Usually 512-1024 for consumer GPUs.
Architecture Patterns
Pattern 1: Multi-User Service
┌─────────┐ ┌─────────┐ ┌─────────┐
│ User A │ │ User B │ │ User C │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└────────────┼────────────┘
↓
┌────────────────┐
│ Load Balancer │
└────────────────┘
↓
┌────────────┼────────────┐
↓ ↓ ↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Seq 1 │ │ Seq 2 │ │ Seq 3 │
└─────────┘ └─────────┘ └─────────┘
└────────────┼────────────┘
↓
┌────────────────┐
│ Shared Model │
└────────────────┘
Pattern 2: Multi-Agent System
┌──────────────┐
│ Task │
└──────┬───────┘
│
┌────────┼────────┐
↓ ↓ ↓
┌────────┐ ┌──────┐ ┌──────────┐
│Planner │ │Critic│ │ Executor │
│ Agent │ │Agent │ │ Agent │
└───┬────┘ └──┬───┘ └────┬─────┘
│ │ │
└─────────┼──────────┘
↓
(All run in parallel)
Pattern 3: Pipeline Processing
Input Queue: [Task1, Task2, Task3, ...]
↓
┌───────────────┐
│ Dispatcher │
└───────────────┘
↓
┌───────────┼───────────┐
↓ ↓ ↓
Sequence 1 Sequence 2 Sequence 3
↓ ↓ ↓
└───────────┼───────────┘
↓
Output: [R1, R2, R3]
Resource Management
Memory Allocation
Each sequence consumes memory:
┌──────────────────────────────────┐
│ Total VRAM: 8GB │
├──────────────────────────────────┤
│ Model Weights: 4.0 GB │
│ Context Base: 1.0 GB │
│ Sequence 1 (KV Cache): 0.8 GB │
│ Sequence 2 (KV Cache): 0.8 GB │
│ Sequence 3 (KV Cache): 0.8 GB │
│ Overhead: 0.6 GB │
├──────────────────────────────────┤
│ Total Used: 8.0 GB │
│ Remaining: 0.0 GB │
└──────────────────────────────────┘
Maximum capacity!
Formula:
Required VRAM = Model + Context + (NumSequences × KVCache)
Finding Optimal Sequence Count
Too Few (1-2) Optimal (4-8) Too Many (16+)
───────────── ───────────── ──────────────
GPU underutilized Balanced use Memory overflow
↓ ↓ ↓
Slow throughput Best performance Thrashing/crashes
Test your system:
- Start with 2 sequences
- Monitor VRAM usage
- Increase until performance plateaus
- Back off if memory issues occur
Real-World Scenarios
Scenario 1: Chatbot Service
Challenge: 100 users, each waiting 2s per response
Sequential: 100 × 2s = 200s (3.3 minutes!)
Parallel (10 seq): 10 batches × 2s = 20s
10x speedup!
Scenario 2: Batch Analysis
Task: Analyze 1000 documents
Sequential: 1000 × 3s = 50 minutes
Parallel (8 seq): 125 batches × 3s = 6.25 minutes
8x speedup!
Scenario 3: Multi-Agent Collaboration
Agents: Planner, Analyzer, Executor (all needed)
Sequential: Wait for each → Slow pipeline
Parallel: All work together → Fast decision-making
Limitations & Considerations
1. Context Capacity Sharing
Problem: Sequences share total context space
───────────────────────────────────────────
Total context: 4096 tokens
2 sequences: Each gets ~2048 tokens max
4 sequences: Each gets ~1024 tokens max
More sequences = Less history per sequence!
2. CPU vs GPU Parallelism
With GPU: CPU Only:
True parallel processing Interleaved processing
Multiple CUDA streams Single thread context-switching
(Still helps throughput!)
3. Not Always Faster
When parallel helps: When it doesn't:
• Independent requests • Dependent requests (must wait)
• I/O-bound operations • Very short prompts (overhead)
• Multiple users • Single sequential conversation
Best Practices
1. Design for Independence
✓ Good: Separate user conversations
✓ Good: Independent analysis tasks
✗ Bad: Sequential reasoning steps (use ReAct instead)
2. Monitor Resources
Track:
• VRAM usage per sequence
• Processing time per request
• Queue depths
• Error rates
3. Implement Graceful Degradation
if (vramExceeded) {
reduceSequenceCount();
// or queue requests instead
}
4. Handle Errors Properly
try {
const results = await Promise.all([...]);
} catch (error) {
// One failure doesn't crash all sequences
handlePartialResults();
}Comparison: Evolution of Performance
Stage Requests/Min Pattern
───────────────── ───────────── ───────────────
1. Basic (intro) 30 Sequential
2. Batch (this) 120 4 sequences
3. Load balanced 240 8 sequences + queue
4. Distributed 1000+ Multiple machines
Key Takeaways
- Parallelism is essential for production AI agent systems
- Sequences share model but maintain independent state
- Promise.all enables concurrent JavaScript execution
- Batch size affects GPU utilization and throughput
- Memory is the limit - more sequences need more VRAM
- Not magic - only helps with independent tasks
Practical Formula
Speedup = min(
Number_of_Sequences,
Available_VRAM / Memory_Per_Sequence,
GPU_Compute_Limit
)
Typically: 2-10x speedup for well-designed systems.
This technique is foundational for building scalable agent architectures that can handle real-world workloads efficiently.