Fix clw-worker-oom: OpenClaw Worker Process Out of Memory

1. Symptoms

The clw-worker-oom error manifests through several observable symptoms that can help you quickly identify a memory exhaustion issue in your OpenClaw worker processes.

Primary Symptoms:

[ERROR] Worker process terminated unexpectedly. Exit code: 137 [clw-worker-oom] Worker worker-02 exceeded memory limit: 2147483648 bytes [clw] Memory utilization: 100.0% of 2048MB limit [clw] Active tasks: 12 | Completed: 847 | Failed: 3


**Process-Level Indicators:**

```bash
# Check for OOM killer invocations
dmesg | grep -i "out of memory"
# Output: [12345.678] Worker invoked oom-killer: pid 12345, uid 1000

# Check worker process memory usage
ps aux | grep clw-worker
# Output:
# USER       PID    %CPU   %MEM    VSZ      RSS     COMMAND
# clw        1234   45.2   95.3    3145728  2097152 clw-worker --id=worker-02

Application-Level Symptoms:

Tasks queue up without being processed
Worker process exits with code 137 (128 + 9 = SIGKILL)
Error logs contain [clw-worker-oom] prefix
Cluster reports worker_health: degraded
Response latency spikes for dispatched tasks

Observable Metrics:

{
  "worker_id": "worker-02",
  "error_code": "clw-worker-oom",
  "memory_limit_mb": 2048,
  "memory_peak_mb": 2100,
  "memory_utilization_pct": 102.5,
  "timestamp": "2025-01-20T14:32:15Z"
}

2. Root Cause

The clw-worker-oom error occurs when a worker process consumes memory beyond its configured limit. Understanding the root causes helps prevent recurrence.

Primary Causes:

Unbounded Memory Growth in Task Handlers

Workers process tasks in a loop. If task handlers accumulate state without cleanup, memory grows until the limit is hit:

# Problematic: accumulating results without memory management
class TaskHandler:
    def __init__(self):
        self.results = []  # Grows indefinitely

    def process(self, task):
        result = self.execute(task)
        self.results.append(result)  # Memory leak
        return result

Large Task Payloads

Processing extremely large payloads without streaming or chunking can exhaust memory:

# Problematic: loading entire payload into memory
def handle_task(task):
    # This loads a potentially massive payload all at once
    data = task.get_full_payload()  # Could be 5GB
    return self.process_large_file(data)

Memory Leaks in Native Dependencies

Workers often depend on native libraries (database clients, HTTP libraries) that may leak memory under certain conditions:

[clw] WARNING: Native memory leak detected in libclient.so
[clw] Leaked: 15MB per 1000 requests
[clw] Estimated time to OOM: 4 hours

Incorrect Memory Configuration

Setting memory limits too low for the workload is a common configuration error:

# Example: undersized memory limit
worker:
  resources:
    memory: "512Mi"  # Too small for expected workload
    cpu: "1000m"

Concurrent Task Overload

Running too many concurrent tasks without memory headroom:

# Problematic: unbounded concurrency
@clw.task_handler
async def process_task(task):
    # Each task allocates 200MB
    # Running 20 concurrently = 4GB needed
    return heavy_operation(task)

# Without limiting concurrency, memory explodes

Garbage Collection Failures

In languages like Go or Java, failed or stalled garbage collection can lead to memory exhaustion before OOM is triggered.

3. Step-by-Step Fix

Here’s a systematic approach to fix clw-worker-oom errors:

Step 1: Identify Current Memory Configuration

Before:

# Check current worker memory limits
clw config show --format=yaml

Output:

worker:
  memory_limit: "2GiB"
  task_concurrency: 50
  max_payload_size: "100MB"

After:

worker:
  memory_limit: "4GiB"
  task_concurrency: 20
  max_payload_size: "50MB"

Step 2: Review Task Handler Code for Memory Leaks

Before:

import asyncio
from typing import List

class HeavyTaskHandler:
    def __init__(self):
        self.cache: List[dict] = []

    async def process(self, task):
        result = await self.compute(task)
        # Memory leak: cache grows forever
        self.cache.append(result)
        # Also storing full payload
        task_data = await task.full_data()  # Keeps in memory
        return result

After:

import asyncio
from collections import OrderedDict
from typing import Dict

class HeavyTaskHandler:
    def __init__(self, max_cache_size: int = 100):
        self.cache: OrderedDict = OrderedDict()
        self.max_cache_size = max_cache_size

    async def process(self, task):
        result = await self.compute(task)
        # Bounded cache with LRU eviction
        self.cache[task.id] = result
        if len(self.cache) > self.max_cache_size:
            self.cache.popitem(last=False)
        return result

    def clear_cache(self):
        self.cache.clear()

Step 3: Implement Streaming for Large Payloads

Before:

async def handle_large_file(task):
    # Loads entire file into memory
    data = await task.download_payload()
    result = process_in_memory(data)
    return result

After:

async def handle_large_file(task):
    # Stream-based processing
    async with task.payload_stream() as stream:
        result = await process_streamed(stream)
    return result

async def process_streamed(stream):
    accumulated_size = 0
    chunk_results = []
    async for chunk in stream:
        result = process_chunk(chunk)
        chunk_results.append(result)
        accumulated_size += len(chunk)
        # Explicit memory management: process in batches
        if accumulated_size > CHUNK_THRESHOLD:
            chunk_results = aggregate_results(chunk_results)
    return final_result(chunk_results)

Step 4: Configure Resource Limits Properly

Before:

# clw-worker.yaml - undersized configuration
apiVersion: clw.io/v1
kind: WorkerConfig
spec:
  resources:
    limits:
      memory: "1Gi"
    requests:
      memory: "1Gi"
  concurrency: 100
  taskTimeout: "30s"

After:

# clw-worker.yaml - properly sized configuration
apiVersion: clw.io/v1
kind: WorkerConfig
spec:
  resources:
    limits:
      memory: "4Gi"
      ephemeral-storage: "2Gi"
    requests:
      memory: "2Gi"
  concurrency: 25
  taskTimeout: "60s"
  memoryReservation: "2Gi"
  oomScoreAdj: 100

Step 5: Add Memory Monitoring and Auto-Restart

Before:

# Minimal worker setup
def main():
    worker = create_worker()
    worker.run()

After:

import resource
import signal
import logging

class MonitoredWorker:
    def __init__(self):
        self.logger = logging.getLogger(__name__)
        self.restart_count = 0
        self.max_restarts = 5

    def setup_memory_monitoring(self):
        # Set up periodic memory checks
        signal.signal(signal.SIGALRM, self.check_memory)
        signal.alarm(30)  # Check every 30 seconds

    def check_memory(self, signum, frame):
        usage = resource.getrusage(resource.RUSAGE_SELF)
        memory_mb = usage.ru_maxrss / 1024

        self.logger.info(f"Memory usage: {memory_mb:.2f} MB")

        if memory_mb > self.memory_threshold_mb:
            self.logger.warning(
                f"Memory threshold exceeded: {memory_mb}MB > {self.memory_threshold_mb}MB"
            )
            self.trigger_graceful_shutdown()

    def trigger_graceful_shutdown(self):
        self.logger.info("Initiating graceful shutdown due to memory pressure")
        self.task_queue.drain()
        sys.exit(0)

def main():
    worker = MonitoredWorker()
    worker.setup_memory_monitoring()
    worker.run()

Step 6: Configure Kubernetes OOM Handling

# worker-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: clw-worker
  labels:
    app: clw-worker
spec:
  containers:
    - name: worker
      image: openclaw/worker:2.1.0
      resources:
        limits:
          memory: "4Gi"
        requests:
          memory: "2Gi"
      env:
        - name: CLW_MEMORY_LIMIT
          valueFrom:
            resourceFieldRef:
              resource: limits.memory
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 30
        periodSeconds: 10
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "clw-worker drain --timeout=30s"]
  restartPolicy: Always
  terminationGracePeriodSeconds: 60

4. Verification

After implementing fixes, verify that the clw-worker-oom error is resolved:

Immediate Verification:

# Check worker health status
clw worker status --watch

# Output should show:
# WORKER ID    STATUS      MEMORY USAGE    TASKS
# worker-01    healthy     1.8GiB / 4GiB   12/1000
# worker-02    healthy     2.1GiB / 4GiB   8/500

# Check for OOM events in logs
clw logs --since=1h --filter=oom

# No recent OOM events found

Load Testing Verification:

# Run memory stress test
clw benchmark --workload=memory-intensive --duration=10m --concurrency=50

# Expected output:
# Benchmark completed
# Total tasks: 5000
# Failed: 0 (0.00%)
# Peak memory: 3.2GiB / 4GiB (80%)
# No OOM events detected

Prometheus Metrics Verification:

# Query memory metrics
clw_metrics_memory_usage_max{worker_id="worker-02"}
# Result: 2100000000 (2.1GB, well under 4GB limit)

clw_worker_restart_total{reason="oom"}
# Result: 0 (no OOM restarts in last 24h)

Application-Level Verification:

# Check worker memory metrics
import clw

stats = clw.get_worker_stats()
print(f"Memory Usage: {stats.memory_used_mb}MB / {stats.memory_limit_mb}MB")
print(f"Utilization: {stats.memory_pct}%")
print(f"OOM Count (24h): {stats.oom_count}")

# Expected output:
# Memory Usage: 2100MB / 4096MB
# Utilization: 51.27%
# OOM Count (24h): 0

5. Common Pitfalls

Avoid these common mistakes when fixing clw-worker-oom errors:

Pitfall 1: Increasing Memory Without Fixing Root Cause

Simply raising the memory limit masks the problem rather than fixing it:

# ❌ Wrong: Just throw more memory at it
resources:
  limits:
    memory: "16Gi"  # This avoids fixing the actual leak

# ✅ Correct: Fix the leak AND right-size the limit
resources:
  limits:
    memory: "4Gi"  # Properly sized for fixed workload

Pitfall 2: Forgetting to Limit Concurrency

Raising memory but leaving concurrency unlimited still causes OOM:

# ❌ Wrong: Unbounded concurrency
spec:
  memory: "8Gi"
  concurrency: 500  # Will still OOM

# ✅ Correct: Both limits properly configured
spec:
  memory: "4Gi"
  concurrency: 25  # Bounded for memory budget

Pitfall 3: Not Setting Up Graceful Shutdown

Abrupt termination leaves tasks in inconsistent state:

# ❌ Wrong: No graceful shutdown handling
lifecycle:
  preStop:
    exec:
      command: ["kill", "-15", "1"]  # Forces kill

# ✅ Correct: Graceful drain period
lifecycle:
  preStop:
    exec:
      command: ["/clw/bin/drain", "--timeout=60s"]
terminationGracePeriodSeconds: 90

Pitfall 4: Missing Memory Monitoring

Not monitoring memory until OOM occurs:

# ❌ Wrong: No monitoring
def run_worker():
    worker = Worker()
    worker.start()

# ✅ Correct: Proactive monitoring
def run_worker():
    worker = Worker()
    worker.add_health_check(
        check_fn=check_memory_usage,
        interval="10s",
        threshold_pct=80
    )
    worker.start()

Pitfall 5: Ignoring Native Memory

Java/Golang applications have native memory beyond heap:

# ❌ Wrong: Only monitoring heap
jstat -gc $(pgrep clw-worker) | head -5

# ✅ Correct: Monitoring all memory
cat /proc/$(pgrep clw-worker)/status | grep -i vmrss
# VmRSS:    4096 MB (includes native memory)

Pitfall 6: Not Handling Memory Pressure in Middleware

# ❌ Wrong: No memory awareness
@app.middleware
async def process_task(request, call_next):
    return await call_next(request)

# ✅ Correct: Memory-aware middleware
@app.middleware
async def process_task(request, call_next):
    if is_memory_pressure():
        return Response(
            status=503,
            body="Server overloaded, retry later"
        )
    return await call_next(request)

The following errors are commonly related to clw-worker-oom:

clw-container-limit

Triggered when container-level resource limits (CPU, memory, storage) are approached. Often precedes clw-worker-oom if memory limits are hit before container limits.

[clw-container-limit] Container approaching memory limit
Memory: 1.9GiB / 2GiB (95%)

clw-task-timeout

Tasks that consume excessive memory often run slowly and hit timeout limits. Memory leaks can cause both OOM and timeout errors:

[clw-task-timeout] Task task-1234 exceeded 120s timeout
[clw] Processing time increased 10x due to memory pressure

clw-dispatch-fail

When workers are OOM and restart continuously, the dispatcher cannot route tasks:

[clw-dispatch-fail] No healthy workers available
[clw] All workers in degraded state (restarting from OOM)

clw-gc-pause

Excessive garbage collection pauses (especially in JVM-based workers) indicate memory pressure and often precede OOM:

[clw-gc-pause] GC pause: 2.5s (threshold: 500ms)
[clw] Memory usage: 92% before GC
[clw] Recommendation: Increase memory or fix leaks

clw-swap-thrash

When memory is exhausted and system starts swapping, performance degrades rapidly:

[clw] Swap usage: 2.3 GiB
[clw] Swap-in rate: 500MB/s
[clw] System under memory pressure

clw-connection-drop

Database or API connections may drop when workers are under memory pressure and cannot respond to keepalives:

[clw-connection-drop] PostgreSQL connection closed
[clw] Worker at 98% memory when connection terminated

Quick Reference:

Metric	Healthy	Warning	Critical
Memory Utilization	< 70%	70-85%	> 85%
OOM Restarts (24h)	0	1-2	> 2
GC Pause	< 200ms	200-500ms	> 500ms
Swap Usage	0	< 1GB	> 1GB

Monitor these metrics to prevent clw-worker-oom from recurring.

1. Symptoms

Primary Symptoms:

2. Root Cause

3. Step-by-Step Fix

Step 1: Identify Current Memory Configuration

Step 2: Review Task Handler Code for Memory Leaks

Step 3: Implement Streaming for Large Payloads

Step 4: Configure Resource Limits Properly

Step 5: Add Memory Monitoring and Auto-Restart

Step 6: Configure Kubernetes OOM Handling

4. Verification

5. Common Pitfalls

6. Related Errors