Fix clw-router-oom: Router Process Memory Exhausted

1. Symptoms

The clw-router-oom error manifests when the OpenClaw router process exhausts its allocated memory budget. This error is critical and typically causes the router to terminate or enter a degraded state.

Observable Symptoms

When this error occurs, you will observe the following symptoms in your system:

Process Termination or Crash:

[OpenClaw Router] FATAL: clw-router-oom: Router heap exhausted
[OpenClaw Router]   Request count: 15,847
[OpenClaw Router]   Memory used: 2048 MB / 2048 MB
[OpenClaw Router]   Duration: 00:23:45
[OpenClaw Router]   Exiting with code 137 (SIGKILL)

Degraded Performance Leading to OOM:

[OpenClaw Router] WARN: Memory pressure detected
[OpenClaw Router]   Allocated: 1536 MB
[OpenClaw Router]   Available: 512 MB
[OpenClaw Router]   GC threshold: 70%
[OpenClaw Router] WARN: Request queue backlog: 2,500 pending requests
[OpenClaw Router] WARN: Slow GC detected (pause: 450ms)

Client-Side Impact:

Requests hang indefinitely with no response
Connection timeouts increase
WebSocket connections drop unexpectedly
API clients receive 503 Service Unavailable responses
Load balancer marks router as unhealthy

Memory Growth Pattern:

[OpenClaw Router] Memory usage trend:
[OpenClaw Router]   00:00:00 - 256 MB
[OpenClaw Router]   00:05:00 - 512 MB
[OpenClaw Router]   00:10:00 - 1024 MB
[OpenClaw Router]   00:15:00 - 1536 MB
[OpenClaw Router]   00:20:00 - 2048 MB (OOM)
[OpenClaw Router]   Growth rate: ~85 MB/minute

The error may also appear in system logs as a process being OOM-killed:

kernel: [12345.678901] oom-killer: gpid=12345, comm=clw-router, 
         oom_score_adj=500, vm_rss=2097152 kB, 
         killed process as a result of limit set

2. Root Cause

The clw-router-oom error occurs due to several underlying issues that cause memory consumption to exceed the router’s allocated heap size.

Primary Root Causes

1. Unbounded Request Queue The router maintains an internal queue for incoming requests. When request processing slows down (due to backend latency, network issues, or downstream service failures), requests accumulate faster than they are processed. The queue grows without limit until memory is exhausted.

// Vulnerable queue implementation (causes OOM)
class RouterQueue {
    constructor() {
        this.pending = []; // No max size enforcement
    }
    
    enqueue(request) {
        this.pending.push(request); // Always succeeds, grows unbounded
    }
}

2. Memory Leaks in Request Handlers Improper handling of request context, response objects, or event listeners can cause memory to accumulate. Common leak patterns include:

Failed to release response streams
Event listeners attached but never removed
Circular references in closure-based handlers
Accumulated buffer data not flushed

// Memory leak pattern example
router.on('request', (req, res) => {
    const buffer = [];
    req.on('data', (chunk) => buffer.push(chunk));
    req.on('end', () => processRequest(buffer));
    
    // BUG: req object and buffer are never cleaned up
    // even after response is sent
});

3. Large Response Caching Without Eviction If the router caches responses or partial data, an unbounded cache will consume all available memory. This is especially problematic with large payloads or high request diversity.

# openclaw.yaml - problematic cache configuration
router:
  cache:
    enabled: true
    max_entries: 1000000  # No size limit, entries only
    ttl: 3600
    # Missing: max_memory_mb or max_entry_size

4. Upstream Response Streaming When the router proxies responses from upstream services, buffering entire responses in memory before sending to clients causes memory spikes, especially with large files or streams.

5. Insufficient Memory Configuration The router may be configured with too little memory for its workload. This can happen when:

Workload has grown since initial deployment
Memory limit was set too conservatively
Other processes on the host consume shared memory

Secondary Contributing Factors

6. Garbage Collection Pressure High allocation rates cause frequent GC cycles, which can temporarily spike memory usage and compound memory pressure issues.

7. Fragmentation Long-running router processes can experience heap fragmentation, reducing effective usable memory even when total allocated memory appears reasonable.

8. Concurrency Overload An excessive number of concurrent requests can cause parallel processing to allocate memory faster than sequential processing would, overwhelming memory limits.

3. Step-by-Step Fix

Fix 1: Implement Bounded Request Queue

Add a maximum queue size with configurable backpressure behavior.

Before:

# openclaw.yaml - unbounded queue (vulnerable)
router:
  queue:
    type: fifo
    # No size limits configured

After:

# openclaw.yaml - bounded queue with backpressure
router:
  queue:
    type: fifo
    max_size: 10000
    max_memory_mb: 512
    overflow_action: reject
    overflow_status: 503
    overflow_message: "Service overloaded, try again later"
    drain_timeout_ms: 30000

Before:

// queue-handler.js - unbounded implementation
function handleRequest(req, res) {
    requestQueue.push({ req, res, timestamp: Date.now() });
    processQueue();
}

function processQueue() {
    while (requestQueue.length > 0) {
        const item = requestQueue.shift();
        // Process item...
    }
}

After:

// queue-handler.js - bounded implementation with backpressure
const MAX_QUEUE_SIZE = 10000;
const requestQueue = [];

function handleRequest(req, res) {
    if (requestQueue.length >= MAX_QUEUE_SIZE) {
        res.writeHead(503, { 'Content-Type': 'application/json' });
        res.end(JSON.stringify({
            error: 'clw-router-oom',
            message: 'Service overloaded, try again later'
        }));
        metrics.increment('router.queue.rejected');
        return;
    }
    
    requestQueue.push({ req, res, timestamp: Date.now() });
    processQueue();
}

function processQueue() {
    // Yield to event loop periodically to prevent blocking
    setImmediate(() => {
        const item = requestQueue.shift();
        if (item) {
            processItem(item);
        }
    });
}

Fix 2: Prevent Memory Leaks in Request Handlers

Implement proper cleanup and resource management.

Before:

// request-handler.js - leaking handler
router.on('request', (req, res) => {
    const buffers = [];
    req.on('data', (chunk) => buffers.push(chunk));
    req.on('end', () => {
        const body = Buffer.concat(buffers);
        processRequest(req, res, body);
    });
    
    // BUG: buffers array grows indefinitely if stream doesn't end
    // BUG: No cleanup after response is complete
});

After:

// request-handler.js - leak-free handler
router.on('request', (req, res) => {
    const buffers = [];
    let streamEnded = false;
    
    function cleanup() {
        // Remove listeners and clear buffers
        req.removeAllListeners('data');
        req.removeAllListeners('end');
        req.removeAllListeners('error');
        buffers.length = 0; // Release buffer memory
    }
    
    req.on('data', (chunk) => {
        buffers.push(chunk);
        // Guard against unbounded buffer growth
        const totalSize = buffers.reduce((sum, b) => sum + b.length, 0);
        if (totalSize > 50 * 1024 * 1024) { // 50MB limit
            cleanup();
            res.writeHead(413, { 'Content-Type': 'application/json' });
            res.end(JSON.stringify({ error: 'Request body too large' }));
            req.destroy();
        }
    });
    
    req.on('end', () => {
        streamEnded = true;
        const body = Buffer.concat(buffers);
        processRequest(req, res, body).finally(cleanup);
    });
    
    req.on('error', (err) => {
        cleanup();
        if (!res.headersSent) {
            res.writeHead(500, { 'Content-Type': 'application/json' });
        }
        res.end(JSON.stringify({ error: 'Request processing failed' }));
    });
    
    // Timeout protection
    const timeout = setTimeout(() => {
        if (!streamEnded) {
            cleanup();
            req.destroy();
            if (!res.headersSent) {
                res.writeHead(408, { 'Content-Type': 'application/json' });
                res.end(JSON.stringify({ error: 'Request timeout' }));
            }
        }
    }, 30000);
    
    res.on('finish', () => clearTimeout(timeout));
});

Fix 3: Configure Memory-Bounded Cache

Add memory limits to prevent unbounded cache growth.

Before:

# openclaw.yaml - unbounded cache
router:
  cache:
    enabled: true
    max_entries: 1000000
    ttl: 3600

After:

# openclaw.yaml - memory-bounded cache
router:
  cache:
    enabled: true
    max_entries: 100000
    max_memory_mb: 256
    max_entry_size_kb: 1024
    ttl: 3600
    eviction_policy: lru
    preload:
      enabled: false

Implementation:

// bounded-cache.js - memory-aware cache
class BoundedCache {
    constructor(options = {}) {
        this.maxMemoryMB = options.max_memory_mb || 256;
        this.maxMemoryBytes = this.maxMemoryMB * 1024 * 1024;
        this.maxEntrySizeKB = options.max_entry_size_kb || 1024;
        this.maxEntrySizeBytes = this.maxEntrySizeKB * 1024;
        this.currentMemoryBytes = 0;
        this.cache = new Map();
        this.accessOrder = [];
    }
    
    get(key) {
        const entry = this.cache.get(key);
        if (entry) {
            this.updateAccessOrder(key);
            entry.hits++;
            return entry.value;
        }
        return null;
    }
    
    set(key, value) {
        const estimatedSize = this.estimateSize(value);
        
        // Reject entries that exceed max entry size
        if (estimatedSize > this.maxEntrySizeBytes) {
            throw new Error(`Cache entry exceeds maximum size: ${estimatedSize} bytes`);
        }
        
        // Evict entries if necessary to make room
        while (this.currentMemoryBytes + estimatedSize > this.maxMemoryBytes) {
            this.evictLRU();
        }
        
        // Remove existing entry if updating
        if (this.cache.has(key)) {
            const existing = this.cache.get(key);
            this.currentMemoryBytes -= existing.size;
        }
        
        this.cache.set(key, {
            value,
            size: estimatedSize,
            createdAt: Date.now(),
            hits: 0
        });
        this.currentMemoryBytes += estimatedSize;
        this.updateAccessOrder(key);
    }
    
    estimateSize(value) {
        if (Buffer.isBuffer(value)) {
            return value.length;
        }
        if (typeof value === 'string') {
            return Buffer.byteLength(value);
        }
        return JSON.stringify(value).length;
    }
    
    evictLRU() {
        const oldestKey = this.accessOrder.shift();
        if (oldestKey) {
            const entry = this.cache.get(oldestKey);
            if (entry) {
                this.currentMemoryBytes -= entry.size;
                this.cache.delete(oldestKey);
            }
        }
    }
    
    updateAccessOrder(key) {
        const index = this.accessOrder.indexOf(key);
        if (index > -1) {
            this.accessOrder.splice(index, 1);
        }
        this.accessOrder.push(key);
    }
}

Fix 4: Stream Responses Instead of Buffering

Implement streaming for large responses to avoid memory spikes.

Before:

// proxy-handler.js - buffering large responses
async function proxyRequest(req, res, upstreamUrl) {
    const response = await fetch(upstreamUrl, {
        method: req.method,
        body: req.body
    });
    
    // BUG: Entire response body is buffered in memory
    const body = await response.arrayBuffer();
    
    res.writeHead(response.status, response.headers);
    res.end(Buffer.from(body)); // Sends entire buffer at once
}

After:

// proxy-handler.js - streaming responses
async function proxyRequest(req, res, upstreamUrl) {
    const controller = new AbortController();
    const timeout = setTimeout(() => controller.abort(), 60000);
    
    try {
        const response = await fetch(upstreamUrl, {
            method: req.method,
            body: req.body,
            signal: controller.signal
        });
        
        clearTimeout(timeout);
        
        res.writeHead(response.status, Object.fromEntries(response.headers));
        
        // Stream response body directly to client
        const reader = response.body.getReader();
        const decoder = new TextDecoder();
        
        while (true) {
            const { done, value } = await reader.read();
            if (done) break;
            
            // Write chunks as they arrive
            if (!res.write(value)) {
                // Handle backpressure - wait for drain
                await new Promise(resolve => res.once('drain', resolve));
            }
        }
        
        res.end();
    } catch (error) {
        clearTimeout(timeout);
        if (!res.headersSent) {
            res.writeHead(502, { 'Content-Type': 'application/json' });
        }
        res.end(JSON.stringify({ error: 'Upstream request failed' }));
    }
}

Fix 5: Increase Router Memory Limit

Adjust the memory allocation when workload legitimately requires more resources.

Before:

# Starting router with insufficient memory
clw-router --memory-mb 512

After:

# Starting router with appropriate memory allocation
clw-router --memory-mb 2048 --memory-swap-mb 256

# Alternative: Via environment variable
export CLW_ROUTER_HEAP_SIZE=2048
export CLW_ROUTER_HEAP_NEWSIZE=512
clw-router

Configuration update:

# openclaw.yaml - memory configuration
router:
  process:
    max_heap_mb: 2048
    max_old_space_mb: 1536
    max_new_space_mb: 256
    gc_interval_ms: 1000
    force_gc_threshold_mb: 256

4. Verification

After applying fixes, verify that the memory issue is resolved by checking the following:

1. Memory Usage Stability

# Monitor router memory over time
watch -n 5 'curl -s http://localhost:3000/metrics | jq .router.memory'

# Expected: Memory should stabilize below threshold
# Example output showing stable memory:
{
  "allocated_mb": 512,
  "used_mb": 384,
  "available_mb": 1664,
  "gc_count": 12,
  "gc_pause_ms": 45
}

2. Queue Backpressure Functionality

# Test queue rejection under load
ab -n 10000 -c 1000 http://localhost:3000/api/endpoint

# Check rejection metrics
curl -s http://localhost:3000/metrics | jq .router.queue
# Expected output:
{
  "size": 10000,
  "max_size": 10000,
  "enqueued": 10000,
  "rejected": 8421,
  "processed": 1579,
  "overflow_active": true
}

3. Memory Leak Detection

# Run heap snapshot comparison after sustained load
# First, trigger garbage collection
curl -X POST http://localhost:3000/admin/gc

# Take heap snapshot
curl -o heap1.heapsnapshot http://localhost:3000/debug/heap-snapshot

# Run load test for 5 minutes
# ...

# Trigger another GC and take snapshot
curl -X POST http://localhost:3000/admin/gc
curl -o heap2.heapsnapshot http://localhost:3000/debug/heap-snapshot

# Compare heap sizes - should not grow significantly
# Use Chrome DevTools to compare the snapshots

4. Cache Memory Bounds

# Verify cache respects memory limits
curl -s http://localhost:3000/metrics | jq .router.cache
# Expected output:
{
  "entries": 50000,
  "memory_mb": 256,
  "max_memory_mb": 256,
  "evictions": 1500,
  "hit_rate": 0.78
}

5. End-to-End Load Test

# Run sustained load test to verify stability
wrk -t4 -c100 -d300s -s post.lua http://localhost:3000/api/endpoint

# Monitor throughout the test
# Memory should remain stable, not growing linearly
# Latency should remain consistent (not increasing)

6. Verify Process Survival

# Run router under memory constraints using cgroup
sudo cgcreate -g memory:/clw-router-test
sudo cgset -r memory.limit_in_bytes=1073741824 /clw-router-test  # 1GB limit
sudo cgexec -g memory:/clw-router-test clw-router --memory-mb 768

# Run load test - process should handle pressure, not crash
# Check that queue backpressure activates before OOM kill

5. Common Pitfalls

Pitfall 1: Increasing Memory Without Addressing Root Cause

Problem: Simply allocating more memory masks the underlying leak or architectural issue. Memory will eventually be exhausted again, often more catastrophically.

Solution: Always diagnose and fix the root cause before increasing memory limits. Use memory profiling to identify leaks.

# DON'T just do this without investigation:
clw-router --memory-mb 8192

# DO profile first:
node --inspect-brk=9229 clw-router
# Then use Chrome DevTools to take heap snapshots and identify leaks

Pitfall 2: Setting Queue Size Too Small

Problem: An overly restrictive queue size causes premature request rejection, resulting in poor user experience even when the system could handle the load.

Solution: Calculate queue size based on expected peak load, request size, and acceptable memory usage.

# WRONG: Too restrictive
router:
  queue:
    max_size: 100
    # Will reject requests during normal peak traffic

# BETTER: Calculate based on workload
router:
  queue:
    max_size: 10000
    max_memory_mb: 512
    # Based on: 1000 req/s * 10s acceptable wait * ~50KB avg = 500MB

Pitfall 3: Forgetting to Handle Backpressure in Client Code

Problem: The router correctly rejects requests when overloaded, but client code doesn’t handle 503 responses gracefully, leading to cascading failures.

Solution: Implement retry logic with exponential backoff in client code.

// Client retry logic with backpressure handling
async function callApiWithRetry(url, options = {}, attempt = 1) {
    const maxAttempts = 5;
    const baseDelay = 1000;
    
    try {
        const response = await fetch(url, options);
        
        if (response.status === 503) {
            // Server is overloaded - retry with backoff
            if (attempt < maxAttempts) {
                const delay = baseDelay * Math.pow(2, attempt - 1);
                console.log(`Service overloaded, retrying in ${delay}ms`);
                await sleep(delay);
                return callApiWithRetry(url, options, attempt + 1);
            }
            throw new Error('Service overloaded after max retries');
        }
        
        return response;
    } catch (error) {
        if (attempt < maxAttempts && isRetryableError(error)) {
            await sleep(baseDelay * Math.pow(2, attempt - 1));
            return callApiWithRetry(url, options, attempt + 1);
        }
        throw error;
    }
}

Pitfall 4: Not Monitoring Memory in Production

Problem: Memory issues only become apparent when the router crashes, causing unexpected outages.

Solution: Implement comprehensive memory monitoring and alerting.

# alerting configuration
monitoring:
  metrics:
    - name: router.memory.used_mb
      type: gauge
      alert_threshold: 1536
      alert_severity: warning
      action: notify
    - name: router.memory.used_mb
      alert_threshold: 1920
      alert_severity: critical
      action: page_oncall
    - name: router.queue.size
      type: gauge
      alert_threshold: 8000
      alert_severity: warning
    - name: router.gc.pause_ms
      type: histogram
      alert_threshold: 500
      alert_severity: warning

Pitfall 5: Ignoring Fragmentation in Long-Running Processes

Problem: Even with bounded memory, heap fragmentation can cause apparent memory exhaustion before the limit is reached.

Solution: Schedule periodic restarts or implement memory defragmentation.

# openclaw.yaml - restart policy
router:
  process:
    restart_on_memory_threshold_mb: 1792
    restart_grace_period_seconds: 30
    min_uptime_seconds: 3600
    # Allows graceful restart before OOM kill

// Or implement manual restart trigger
async function checkMemoryHealth() {
    const memoryUsage = process.memoryUsage();
    const heapUsedMB = memoryUsage.heapUsed / 1024 / 1024;
    const heapTotalMB = memoryUsage.heapTotal / 1024 / 1024;
    const fragmentationRatio = heapTotalMB / (heapUsedMB + 1024);
    
    // Trigger graceful restart if fragmented
    if (fragmentationRatio > 2.0 && heapUsedMB < 512) {
        console.log('High fragmentation detected, initiating graceful restart');
        await initiateGracefulRestart();
    }
}

Pitfall 6: Missing Request Size Limits

Problem: Individual requests with large payloads can consume significant memory and trigger OOM.

Solution: Always enforce maximum request body size limits.

# openclaw.yaml - request size limits
router:
  limits:
    max_request_body_mb: 10
    max_request_headers_kb: 32
    max_concurrent_requests: 1000

// Also enforce in request handler
function enforceBodyLimit(req, res, maxBytes) {
    let bytesReceived = 0;
    
    req.on('data', (chunk) => {
        bytesReceived += chunk.length;
        if (bytesReceived > maxBytes) {
            req.destroy();
            if (!res.headersSent) {
                res.writeHead(413);
                res.end('Request body too large');
            }
        }
    });
}

The following errors are commonly related to clw-router-oom:

Error Code	Relationship	Description
`clw-router-disconnect`	Downstream effect	Router disconnects clients during OOM recovery
`clw-worker-oom`	Similar pattern	Worker process memory exhaustion (same root causes)
`clw-queue-overflow`	Precursor condition	Queue fills up before OOM occurs
`clw-socket-econnreset`	Downstream effect	Connection resets due to router instability from OOM
`clw-process-crash`	Downstream effect	Router process terminates due to OOM kill
`clw-gc-pause-exceeded`	Related symptom	Long GC pauses indicate memory pressure
`clw-heap-growing`	Precursor condition	Continuous heap growth precedes OOM

Diagnostic Flow

When encountering clw-router-oom, follow this diagnostic sequence:

Check for clw-queue-overflow - Was the queue backing up before OOM?
Check for clw-gc-pause-exceeded - Was GC struggling with memory pressure?
Review memory metrics - Identify the peak usage pattern
Analyze heap dumps - Find memory leak sources
Review queue configuration - Verify bounded limits
Check cache configuration - Ensure memory-bounded cache

Escalation Path

clw-router-oom
    ├── Check router logs for precursor warnings
    │   └── Look for: clw-gc-pause-exceeded, clw-queue-overflow
    ├── Verify queue backpressure is configured
    │   └── If not, apply Fix 1
    ├── Profile for memory leaks
    │   ├── If leaks found, apply Fix 2
    │   └── If no leaks, check workload vs. allocated memory
    ├── Verify cache memory bounds
    │   └── If unbounded, apply Fix 3
    ├── Consider increasing memory limit
    │   └── Apply Fix 5 if workload legitimately requires more
    └── Implement monitoring
        └── Apply Pitfall 4 monitoring configuration

1. Symptoms

Observable Symptoms

2. Root Cause

Primary Root Causes

Secondary Contributing Factors

3. Step-by-Step Fix

Fix 1: Implement Bounded Request Queue

Fix 2: Prevent Memory Leaks in Request Handlers

Fix 3: Configure Memory-Bounded Cache

Fix 4: Stream Responses Instead of Buffering

Fix 5: Increase Router Memory Limit

4. Verification

5. Common Pitfalls

Pitfall 1: Increasing Memory Without Addressing Root Cause

Pitfall 2: Setting Queue Size Too Small

Pitfall 3: Forgetting to Handle Backpressure in Client Code

Pitfall 4: Not Monitoring Memory in Production

Pitfall 5: Ignoring Fragmentation in Long-Running Processes

Pitfall 6: Missing Request Size Limits

6. Related Errors

Diagnostic Flow

Escalation Path