1. Symptoms
The clw-router-oom error manifests when the OpenClaw router process exhausts its allocated memory budget. This error is critical and typically causes the router to terminate or enter a degraded state.
Observable Symptoms
When this error occurs, you will observe the following symptoms in your system:
Process Termination or Crash:
[OpenClaw Router] FATAL: clw-router-oom: Router heap exhausted
[OpenClaw Router] Request count: 15,847
[OpenClaw Router] Memory used: 2048 MB / 2048 MB
[OpenClaw Router] Duration: 00:23:45
[OpenClaw Router] Exiting with code 137 (SIGKILL)
Degraded Performance Leading to OOM:
[OpenClaw Router] WARN: Memory pressure detected
[OpenClaw Router] Allocated: 1536 MB
[OpenClaw Router] Available: 512 MB
[OpenClaw Router] GC threshold: 70%
[OpenClaw Router] WARN: Request queue backlog: 2,500 pending requests
[OpenClaw Router] WARN: Slow GC detected (pause: 450ms)
Client-Side Impact:
- Requests hang indefinitely with no response
- Connection timeouts increase
- WebSocket connections drop unexpectedly
- API clients receive
503 Service Unavailableresponses - Load balancer marks router as unhealthy
Memory Growth Pattern:
[OpenClaw Router] Memory usage trend:
[OpenClaw Router] 00:00:00 - 256 MB
[OpenClaw Router] 00:05:00 - 512 MB
[OpenClaw Router] 00:10:00 - 1024 MB
[OpenClaw Router] 00:15:00 - 1536 MB
[OpenClaw Router] 00:20:00 - 2048 MB (OOM)
[OpenClaw Router] Growth rate: ~85 MB/minute
The error may also appear in system logs as a process being OOM-killed:
kernel: [12345.678901] oom-killer: gpid=12345, comm=clw-router,
oom_score_adj=500, vm_rss=2097152 kB,
killed process as a result of limit set
2. Root Cause
The clw-router-oom error occurs due to several underlying issues that cause memory consumption to exceed the router’s allocated heap size.
Primary Root Causes
1. Unbounded Request Queue The router maintains an internal queue for incoming requests. When request processing slows down (due to backend latency, network issues, or downstream service failures), requests accumulate faster than they are processed. The queue grows without limit until memory is exhausted.
// Vulnerable queue implementation (causes OOM)
class RouterQueue {
constructor() {
this.pending = []; // No max size enforcement
}
enqueue(request) {
this.pending.push(request); // Always succeeds, grows unbounded
}
}
2. Memory Leaks in Request Handlers Improper handling of request context, response objects, or event listeners can cause memory to accumulate. Common leak patterns include:
- Failed to release response streams
- Event listeners attached but never removed
- Circular references in closure-based handlers
- Accumulated buffer data not flushed
// Memory leak pattern example
router.on('request', (req, res) => {
const buffer = [];
req.on('data', (chunk) => buffer.push(chunk));
req.on('end', () => processRequest(buffer));
// BUG: req object and buffer are never cleaned up
// even after response is sent
});
3. Large Response Caching Without Eviction If the router caches responses or partial data, an unbounded cache will consume all available memory. This is especially problematic with large payloads or high request diversity.
# openclaw.yaml - problematic cache configuration
router:
cache:
enabled: true
max_entries: 1000000 # No size limit, entries only
ttl: 3600
# Missing: max_memory_mb or max_entry_size
4. Upstream Response Streaming When the router proxies responses from upstream services, buffering entire responses in memory before sending to clients causes memory spikes, especially with large files or streams.
5. Insufficient Memory Configuration The router may be configured with too little memory for its workload. This can happen when:
- Workload has grown since initial deployment
- Memory limit was set too conservatively
- Other processes on the host consume shared memory
Secondary Contributing Factors
6. Garbage Collection Pressure High allocation rates cause frequent GC cycles, which can temporarily spike memory usage and compound memory pressure issues.
7. Fragmentation Long-running router processes can experience heap fragmentation, reducing effective usable memory even when total allocated memory appears reasonable.
8. Concurrency Overload An excessive number of concurrent requests can cause parallel processing to allocate memory faster than sequential processing would, overwhelming memory limits.
3. Step-by-Step Fix
Fix 1: Implement Bounded Request Queue
Add a maximum queue size with configurable backpressure behavior.
Before:
# openclaw.yaml - unbounded queue (vulnerable)
router:
queue:
type: fifo
# No size limits configured
After:
# openclaw.yaml - bounded queue with backpressure
router:
queue:
type: fifo
max_size: 10000
max_memory_mb: 512
overflow_action: reject
overflow_status: 503
overflow_message: "Service overloaded, try again later"
drain_timeout_ms: 30000
Before:
// queue-handler.js - unbounded implementation
function handleRequest(req, res) {
requestQueue.push({ req, res, timestamp: Date.now() });
processQueue();
}
function processQueue() {
while (requestQueue.length > 0) {
const item = requestQueue.shift();
// Process item...
}
}
After:
// queue-handler.js - bounded implementation with backpressure
const MAX_QUEUE_SIZE = 10000;
const requestQueue = [];
function handleRequest(req, res) {
if (requestQueue.length >= MAX_QUEUE_SIZE) {
res.writeHead(503, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({
error: 'clw-router-oom',
message: 'Service overloaded, try again later'
}));
metrics.increment('router.queue.rejected');
return;
}
requestQueue.push({ req, res, timestamp: Date.now() });
processQueue();
}
function processQueue() {
// Yield to event loop periodically to prevent blocking
setImmediate(() => {
const item = requestQueue.shift();
if (item) {
processItem(item);
}
});
}
Fix 2: Prevent Memory Leaks in Request Handlers
Implement proper cleanup and resource management.
Before:
// request-handler.js - leaking handler
router.on('request', (req, res) => {
const buffers = [];
req.on('data', (chunk) => buffers.push(chunk));
req.on('end', () => {
const body = Buffer.concat(buffers);
processRequest(req, res, body);
});
// BUG: buffers array grows indefinitely if stream doesn't end
// BUG: No cleanup after response is complete
});
After:
// request-handler.js - leak-free handler
router.on('request', (req, res) => {
const buffers = [];
let streamEnded = false;
function cleanup() {
// Remove listeners and clear buffers
req.removeAllListeners('data');
req.removeAllListeners('end');
req.removeAllListeners('error');
buffers.length = 0; // Release buffer memory
}
req.on('data', (chunk) => {
buffers.push(chunk);
// Guard against unbounded buffer growth
const totalSize = buffers.reduce((sum, b) => sum + b.length, 0);
if (totalSize > 50 * 1024 * 1024) { // 50MB limit
cleanup();
res.writeHead(413, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({ error: 'Request body too large' }));
req.destroy();
}
});
req.on('end', () => {
streamEnded = true;
const body = Buffer.concat(buffers);
processRequest(req, res, body).finally(cleanup);
});
req.on('error', (err) => {
cleanup();
if (!res.headersSent) {
res.writeHead(500, { 'Content-Type': 'application/json' });
}
res.end(JSON.stringify({ error: 'Request processing failed' }));
});
// Timeout protection
const timeout = setTimeout(() => {
if (!streamEnded) {
cleanup();
req.destroy();
if (!res.headersSent) {
res.writeHead(408, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({ error: 'Request timeout' }));
}
}
}, 30000);
res.on('finish', () => clearTimeout(timeout));
});
Fix 3: Configure Memory-Bounded Cache
Add memory limits to prevent unbounded cache growth.
Before:
# openclaw.yaml - unbounded cache
router:
cache:
enabled: true
max_entries: 1000000
ttl: 3600
After:
# openclaw.yaml - memory-bounded cache
router:
cache:
enabled: true
max_entries: 100000
max_memory_mb: 256
max_entry_size_kb: 1024
ttl: 3600
eviction_policy: lru
preload:
enabled: false
Implementation:
// bounded-cache.js - memory-aware cache
class BoundedCache {
constructor(options = {}) {
this.maxMemoryMB = options.max_memory_mb || 256;
this.maxMemoryBytes = this.maxMemoryMB * 1024 * 1024;
this.maxEntrySizeKB = options.max_entry_size_kb || 1024;
this.maxEntrySizeBytes = this.maxEntrySizeKB * 1024;
this.currentMemoryBytes = 0;
this.cache = new Map();
this.accessOrder = [];
}
get(key) {
const entry = this.cache.get(key);
if (entry) {
this.updateAccessOrder(key);
entry.hits++;
return entry.value;
}
return null;
}
set(key, value) {
const estimatedSize = this.estimateSize(value);
// Reject entries that exceed max entry size
if (estimatedSize > this.maxEntrySizeBytes) {
throw new Error(`Cache entry exceeds maximum size: ${estimatedSize} bytes`);
}
// Evict entries if necessary to make room
while (this.currentMemoryBytes + estimatedSize > this.maxMemoryBytes) {
this.evictLRU();
}
// Remove existing entry if updating
if (this.cache.has(key)) {
const existing = this.cache.get(key);
this.currentMemoryBytes -= existing.size;
}
this.cache.set(key, {
value,
size: estimatedSize,
createdAt: Date.now(),
hits: 0
});
this.currentMemoryBytes += estimatedSize;
this.updateAccessOrder(key);
}
estimateSize(value) {
if (Buffer.isBuffer(value)) {
return value.length;
}
if (typeof value === 'string') {
return Buffer.byteLength(value);
}
return JSON.stringify(value).length;
}
evictLRU() {
const oldestKey = this.accessOrder.shift();
if (oldestKey) {
const entry = this.cache.get(oldestKey);
if (entry) {
this.currentMemoryBytes -= entry.size;
this.cache.delete(oldestKey);
}
}
}
updateAccessOrder(key) {
const index = this.accessOrder.indexOf(key);
if (index > -1) {
this.accessOrder.splice(index, 1);
}
this.accessOrder.push(key);
}
}
Fix 4: Stream Responses Instead of Buffering
Implement streaming for large responses to avoid memory spikes.
Before:
// proxy-handler.js - buffering large responses
async function proxyRequest(req, res, upstreamUrl) {
const response = await fetch(upstreamUrl, {
method: req.method,
body: req.body
});
// BUG: Entire response body is buffered in memory
const body = await response.arrayBuffer();
res.writeHead(response.status, response.headers);
res.end(Buffer.from(body)); // Sends entire buffer at once
}
After:
// proxy-handler.js - streaming responses
async function proxyRequest(req, res, upstreamUrl) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 60000);
try {
const response = await fetch(upstreamUrl, {
method: req.method,
body: req.body,
signal: controller.signal
});
clearTimeout(timeout);
res.writeHead(response.status, Object.fromEntries(response.headers));
// Stream response body directly to client
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
// Write chunks as they arrive
if (!res.write(value)) {
// Handle backpressure - wait for drain
await new Promise(resolve => res.once('drain', resolve));
}
}
res.end();
} catch (error) {
clearTimeout(timeout);
if (!res.headersSent) {
res.writeHead(502, { 'Content-Type': 'application/json' });
}
res.end(JSON.stringify({ error: 'Upstream request failed' }));
}
}
Fix 5: Increase Router Memory Limit
Adjust the memory allocation when workload legitimately requires more resources.
Before:
# Starting router with insufficient memory
clw-router --memory-mb 512
After:
# Starting router with appropriate memory allocation
clw-router --memory-mb 2048 --memory-swap-mb 256
# Alternative: Via environment variable
export CLW_ROUTER_HEAP_SIZE=2048
export CLW_ROUTER_HEAP_NEWSIZE=512
clw-router
Configuration update:
# openclaw.yaml - memory configuration
router:
process:
max_heap_mb: 2048
max_old_space_mb: 1536
max_new_space_mb: 256
gc_interval_ms: 1000
force_gc_threshold_mb: 256
4. Verification
After applying fixes, verify that the memory issue is resolved by checking the following:
1. Memory Usage Stability
# Monitor router memory over time
watch -n 5 'curl -s http://localhost:3000/metrics | jq .router.memory'
# Expected: Memory should stabilize below threshold
# Example output showing stable memory:
{
"allocated_mb": 512,
"used_mb": 384,
"available_mb": 1664,
"gc_count": 12,
"gc_pause_ms": 45
}
2. Queue Backpressure Functionality
# Test queue rejection under load
ab -n 10000 -c 1000 http://localhost:3000/api/endpoint
# Check rejection metrics
curl -s http://localhost:3000/metrics | jq .router.queue
# Expected output:
{
"size": 10000,
"max_size": 10000,
"enqueued": 10000,
"rejected": 8421,
"processed": 1579,
"overflow_active": true
}
3. Memory Leak Detection
# Run heap snapshot comparison after sustained load
# First, trigger garbage collection
curl -X POST http://localhost:3000/admin/gc
# Take heap snapshot
curl -o heap1.heapsnapshot http://localhost:3000/debug/heap-snapshot
# Run load test for 5 minutes
# ...
# Trigger another GC and take snapshot
curl -X POST http://localhost:3000/admin/gc
curl -o heap2.heapsnapshot http://localhost:3000/debug/heap-snapshot
# Compare heap sizes - should not grow significantly
# Use Chrome DevTools to compare the snapshots
4. Cache Memory Bounds
# Verify cache respects memory limits
curl -s http://localhost:3000/metrics | jq .router.cache
# Expected output:
{
"entries": 50000,
"memory_mb": 256,
"max_memory_mb": 256,
"evictions": 1500,
"hit_rate": 0.78
}
5. End-to-End Load Test
# Run sustained load test to verify stability
wrk -t4 -c100 -d300s -s post.lua http://localhost:3000/api/endpoint
# Monitor throughout the test
# Memory should remain stable, not growing linearly
# Latency should remain consistent (not increasing)
6. Verify Process Survival
# Run router under memory constraints using cgroup
sudo cgcreate -g memory:/clw-router-test
sudo cgset -r memory.limit_in_bytes=1073741824 /clw-router-test # 1GB limit
sudo cgexec -g memory:/clw-router-test clw-router --memory-mb 768
# Run load test - process should handle pressure, not crash
# Check that queue backpressure activates before OOM kill
5. Common Pitfalls
Pitfall 1: Increasing Memory Without Addressing Root Cause
Problem: Simply allocating more memory masks the underlying leak or architectural issue. Memory will eventually be exhausted again, often more catastrophically.
Solution: Always diagnose and fix the root cause before increasing memory limits. Use memory profiling to identify leaks.
# DON'T just do this without investigation:
clw-router --memory-mb 8192
# DO profile first:
node --inspect-brk=9229 clw-router
# Then use Chrome DevTools to take heap snapshots and identify leaks
Pitfall 2: Setting Queue Size Too Small
Problem: An overly restrictive queue size causes premature request rejection, resulting in poor user experience even when the system could handle the load.
Solution: Calculate queue size based on expected peak load, request size, and acceptable memory usage.
# WRONG: Too restrictive
router:
queue:
max_size: 100
# Will reject requests during normal peak traffic
# BETTER: Calculate based on workload
router:
queue:
max_size: 10000
max_memory_mb: 512
# Based on: 1000 req/s * 10s acceptable wait * ~50KB avg = 500MB
Pitfall 3: Forgetting to Handle Backpressure in Client Code
Problem: The router correctly rejects requests when overloaded, but client code doesn’t handle 503 responses gracefully, leading to cascading failures.
Solution: Implement retry logic with exponential backoff in client code.
// Client retry logic with backpressure handling
async function callApiWithRetry(url, options = {}, attempt = 1) {
const maxAttempts = 5;
const baseDelay = 1000;
try {
const response = await fetch(url, options);
if (response.status === 503) {
// Server is overloaded - retry with backoff
if (attempt < maxAttempts) {
const delay = baseDelay * Math.pow(2, attempt - 1);
console.log(`Service overloaded, retrying in ${delay}ms`);
await sleep(delay);
return callApiWithRetry(url, options, attempt + 1);
}
throw new Error('Service overloaded after max retries');
}
return response;
} catch (error) {
if (attempt < maxAttempts && isRetryableError(error)) {
await sleep(baseDelay * Math.pow(2, attempt - 1));
return callApiWithRetry(url, options, attempt + 1);
}
throw error;
}
}
Pitfall 4: Not Monitoring Memory in Production
Problem: Memory issues only become apparent when the router crashes, causing unexpected outages.
Solution: Implement comprehensive memory monitoring and alerting.
# alerting configuration
monitoring:
metrics:
- name: router.memory.used_mb
type: gauge
alert_threshold: 1536
alert_severity: warning
action: notify
- name: router.memory.used_mb
alert_threshold: 1920
alert_severity: critical
action: page_oncall
- name: router.queue.size
type: gauge
alert_threshold: 8000
alert_severity: warning
- name: router.gc.pause_ms
type: histogram
alert_threshold: 500
alert_severity: warning
Pitfall 5: Ignoring Fragmentation in Long-Running Processes
Problem: Even with bounded memory, heap fragmentation can cause apparent memory exhaustion before the limit is reached.
Solution: Schedule periodic restarts or implement memory defragmentation.
# openclaw.yaml - restart policy
router:
process:
restart_on_memory_threshold_mb: 1792
restart_grace_period_seconds: 30
min_uptime_seconds: 3600
# Allows graceful restart before OOM kill
// Or implement manual restart trigger
async function checkMemoryHealth() {
const memoryUsage = process.memoryUsage();
const heapUsedMB = memoryUsage.heapUsed / 1024 / 1024;
const heapTotalMB = memoryUsage.heapTotal / 1024 / 1024;
const fragmentationRatio = heapTotalMB / (heapUsedMB + 1024);
// Trigger graceful restart if fragmented
if (fragmentationRatio > 2.0 && heapUsedMB < 512) {
console.log('High fragmentation detected, initiating graceful restart');
await initiateGracefulRestart();
}
}
Pitfall 6: Missing Request Size Limits
Problem: Individual requests with large payloads can consume significant memory and trigger OOM.
Solution: Always enforce maximum request body size limits.
# openclaw.yaml - request size limits
router:
limits:
max_request_body_mb: 10
max_request_headers_kb: 32
max_concurrent_requests: 1000
// Also enforce in request handler
function enforceBodyLimit(req, res, maxBytes) {
let bytesReceived = 0;
req.on('data', (chunk) => {
bytesReceived += chunk.length;
if (bytesReceived > maxBytes) {
req.destroy();
if (!res.headersSent) {
res.writeHead(413);
res.end('Request body too large');
}
}
});
}
6. Related Errors
The following errors are commonly related to clw-router-oom:
| Error Code | Relationship | Description |
|---|---|---|
clw-router-disconnect |
Downstream effect | Router disconnects clients during OOM recovery |
clw-worker-oom |
Similar pattern | Worker process memory exhaustion (same root causes) |
clw-queue-overflow |
Precursor condition | Queue fills up before OOM occurs |
clw-socket-econnreset |
Downstream effect | Connection resets due to router instability from OOM |
clw-process-crash |
Downstream effect | Router process terminates due to OOM kill |
clw-gc-pause-exceeded |
Related symptom | Long GC pauses indicate memory pressure |
clw-heap-growing |
Precursor condition | Continuous heap growth precedes OOM |
Diagnostic Flow
When encountering clw-router-oom, follow this diagnostic sequence:
- Check for
clw-queue-overflow- Was the queue backing up before OOM? - Check for
clw-gc-pause-exceeded- Was GC struggling with memory pressure? - Review memory metrics - Identify the peak usage pattern
- Analyze heap dumps - Find memory leak sources
- Review queue configuration - Verify bounded limits
- Check cache configuration - Ensure memory-bounded cache
Escalation Path
clw-router-oom
├── Check router logs for precursor warnings
│ └── Look for: clw-gc-pause-exceeded, clw-queue-overflow
├── Verify queue backpressure is configured
│ └── If not, apply Fix 1
├── Profile for memory leaks
│ ├── If leaks found, apply Fix 2
│ └── If no leaks, check workload vs. allocated memory
├── Verify cache memory bounds
│ └── If unbounded, apply Fix 3
├── Consider increasing memory limit
│ └── Apply Fix 5 if workload legitimately requires more
└── Implement monitoring
└── Apply Pitfall 4 monitoring configuration