Fix clw-worker-oom: OpenClaw Worker Process Out of Memory

OpenClaw intermediate Linux macOS Docker Kubernetes

1. Symptoms

The clw-worker-oom error manifests through several observable symptoms that can help you quickly identify a memory exhaustion issue in your OpenClaw worker processes.

Primary Symptoms:

[ERROR] Worker process terminated unexpectedly. Exit code: 137 [clw-worker-oom] Worker worker-02 exceeded memory limit: 2147483648 bytes [clw] Memory utilization: 100.0% of 2048MB limit [clw] Active tasks: 12 | Completed: 847 | Failed: 3


**Process-Level Indicators:**

```bash
# Check for OOM killer invocations
dmesg | grep -i "out of memory"
# Output: [12345.678] Worker invoked oom-killer: pid 12345, uid 1000

# Check worker process memory usage
ps aux | grep clw-worker
# Output:
# USER       PID    %CPU   %MEM    VSZ      RSS     COMMAND
# clw        1234   45.2   95.3    3145728  2097152 clw-worker --id=worker-02

Application-Level Symptoms:

  • Tasks queue up without being processed
  • Worker process exits with code 137 (128 + 9 = SIGKILL)
  • Error logs contain [clw-worker-oom] prefix
  • Cluster reports worker_health: degraded
  • Response latency spikes for dispatched tasks

Observable Metrics:

{
  "worker_id": "worker-02",
  "error_code": "clw-worker-oom",
  "memory_limit_mb": 2048,
  "memory_peak_mb": 2100,
  "memory_utilization_pct": 102.5,
  "timestamp": "2025-01-20T14:32:15Z"
}

2. Root Cause

The clw-worker-oom error occurs when a worker process consumes memory beyond its configured limit. Understanding the root causes helps prevent recurrence.

Primary Causes:

  1. Unbounded Memory Growth in Task Handlers

Workers process tasks in a loop. If task handlers accumulate state without cleanup, memory grows until the limit is hit:

# Problematic: accumulating results without memory management
class TaskHandler:
    def __init__(self):
        self.results = []  # Grows indefinitely

    def process(self, task):
        result = self.execute(task)
        self.results.append(result)  # Memory leak
        return result
  1. Large Task Payloads

Processing extremely large payloads without streaming or chunking can exhaust memory:

# Problematic: loading entire payload into memory
def handle_task(task):
    # This loads a potentially massive payload all at once
    data = task.get_full_payload()  # Could be 5GB
    return self.process_large_file(data)
  1. Memory Leaks in Native Dependencies

Workers often depend on native libraries (database clients, HTTP libraries) that may leak memory under certain conditions:

[clw] WARNING: Native memory leak detected in libclient.so
[clw] Leaked: 15MB per 1000 requests
[clw] Estimated time to OOM: 4 hours
  1. Incorrect Memory Configuration

Setting memory limits too low for the workload is a common configuration error:

# Example: undersized memory limit
worker:
  resources:
    memory: "512Mi"  # Too small for expected workload
    cpu: "1000m"
  1. Concurrent Task Overload

Running too many concurrent tasks without memory headroom:

# Problematic: unbounded concurrency
@clw.task_handler
async def process_task(task):
    # Each task allocates 200MB
    # Running 20 concurrently = 4GB needed
    return heavy_operation(task)

# Without limiting concurrency, memory explodes
  1. Garbage Collection Failures

In languages like Go or Java, failed or stalled garbage collection can lead to memory exhaustion before OOM is triggered.

3. Step-by-Step Fix

Here’s a systematic approach to fix clw-worker-oom errors:

Step 1: Identify Current Memory Configuration

Before:

# Check current worker memory limits
clw config show --format=yaml

Output:

worker:
  memory_limit: "2GiB"
  task_concurrency: 50
  max_payload_size: "100MB"

After:

worker:
  memory_limit: "4GiB"
  task_concurrency: 20
  max_payload_size: "50MB"

Step 2: Review Task Handler Code for Memory Leaks

Before:

import asyncio
from typing import List

class HeavyTaskHandler:
    def __init__(self):
        self.cache: List[dict] = []

    async def process(self, task):
        result = await self.compute(task)
        # Memory leak: cache grows forever
        self.cache.append(result)
        # Also storing full payload
        task_data = await task.full_data()  # Keeps in memory
        return result

After:

import asyncio
from collections import OrderedDict
from typing import Dict

class HeavyTaskHandler:
    def __init__(self, max_cache_size: int = 100):
        self.cache: OrderedDict = OrderedDict()
        self.max_cache_size = max_cache_size

    async def process(self, task):
        result = await self.compute(task)
        # Bounded cache with LRU eviction
        self.cache[task.id] = result
        if len(self.cache) > self.max_cache_size:
            self.cache.popitem(last=False)
        return result

    def clear_cache(self):
        self.cache.clear()

Step 3: Implement Streaming for Large Payloads

Before:

async def handle_large_file(task):
    # Loads entire file into memory
    data = await task.download_payload()
    result = process_in_memory(data)
    return result

After:

async def handle_large_file(task):
    # Stream-based processing
    async with task.payload_stream() as stream:
        result = await process_streamed(stream)
    return result

async def process_streamed(stream):
    accumulated_size = 0
    chunk_results = []
    async for chunk in stream:
        result = process_chunk(chunk)
        chunk_results.append(result)
        accumulated_size += len(chunk)
        # Explicit memory management: process in batches
        if accumulated_size > CHUNK_THRESHOLD:
            chunk_results = aggregate_results(chunk_results)
    return final_result(chunk_results)

Step 4: Configure Resource Limits Properly

Before:

# clw-worker.yaml - undersized configuration
apiVersion: clw.io/v1
kind: WorkerConfig
spec:
  resources:
    limits:
      memory: "1Gi"
    requests:
      memory: "1Gi"
  concurrency: 100
  taskTimeout: "30s"

After:

# clw-worker.yaml - properly sized configuration
apiVersion: clw.io/v1
kind: WorkerConfig
spec:
  resources:
    limits:
      memory: "4Gi"
      ephemeral-storage: "2Gi"
    requests:
      memory: "2Gi"
  concurrency: 25
  taskTimeout: "60s"
  memoryReservation: "2Gi"
  oomScoreAdj: 100

Step 5: Add Memory Monitoring and Auto-Restart

Before:

# Minimal worker setup
def main():
    worker = create_worker()
    worker.run()

After:

import resource
import signal
import logging

class MonitoredWorker:
    def __init__(self):
        self.logger = logging.getLogger(__name__)
        self.restart_count = 0
        self.max_restarts = 5

    def setup_memory_monitoring(self):
        # Set up periodic memory checks
        signal.signal(signal.SIGALRM, self.check_memory)
        signal.alarm(30)  # Check every 30 seconds

    def check_memory(self, signum, frame):
        usage = resource.getrusage(resource.RUSAGE_SELF)
        memory_mb = usage.ru_maxrss / 1024

        self.logger.info(f"Memory usage: {memory_mb:.2f} MB")

        if memory_mb > self.memory_threshold_mb:
            self.logger.warning(
                f"Memory threshold exceeded: {memory_mb}MB > {self.memory_threshold_mb}MB"
            )
            self.trigger_graceful_shutdown()

    def trigger_graceful_shutdown(self):
        self.logger.info("Initiating graceful shutdown due to memory pressure")
        self.task_queue.drain()
        sys.exit(0)

def main():
    worker = MonitoredWorker()
    worker.setup_memory_monitoring()
    worker.run()

Step 6: Configure Kubernetes OOM Handling

# worker-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: clw-worker
  labels:
    app: clw-worker
spec:
  containers:
    - name: worker
      image: openclaw/worker:2.1.0
      resources:
        limits:
          memory: "4Gi"
        requests:
          memory: "2Gi"
      env:
        - name: CLW_MEMORY_LIMIT
          valueFrom:
            resourceFieldRef:
              resource: limits.memory
      livenessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 30
        periodSeconds: 10
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "clw-worker drain --timeout=30s"]
  restartPolicy: Always
  terminationGracePeriodSeconds: 60

4. Verification

After implementing fixes, verify that the clw-worker-oom error is resolved:

Immediate Verification:

# Check worker health status
clw worker status --watch

# Output should show:
# WORKER ID    STATUS      MEMORY USAGE    TASKS
# worker-01    healthy     1.8GiB / 4GiB   12/1000
# worker-02    healthy     2.1GiB / 4GiB   8/500

# Check for OOM events in logs
clw logs --since=1h --filter=oom

# No recent OOM events found

Load Testing Verification:

# Run memory stress test
clw benchmark --workload=memory-intensive --duration=10m --concurrency=50

# Expected output:
# Benchmark completed
# Total tasks: 5000
# Failed: 0 (0.00%)
# Peak memory: 3.2GiB / 4GiB (80%)
# No OOM events detected

Prometheus Metrics Verification:

# Query memory metrics
clw_metrics_memory_usage_max{worker_id="worker-02"}
# Result: 2100000000 (2.1GB, well under 4GB limit)

clw_worker_restart_total{reason="oom"}
# Result: 0 (no OOM restarts in last 24h)

Application-Level Verification:

# Check worker memory metrics
import clw

stats = clw.get_worker_stats()
print(f"Memory Usage: {stats.memory_used_mb}MB / {stats.memory_limit_mb}MB")
print(f"Utilization: {stats.memory_pct}%")
print(f"OOM Count (24h): {stats.oom_count}")

# Expected output:
# Memory Usage: 2100MB / 4096MB
# Utilization: 51.27%
# OOM Count (24h): 0

5. Common Pitfalls

Avoid these common mistakes when fixing clw-worker-oom errors:

Pitfall 1: Increasing Memory Without Fixing Root Cause

Simply raising the memory limit masks the problem rather than fixing it:

# ❌ Wrong: Just throw more memory at it
resources:
  limits:
    memory: "16Gi"  # This avoids fixing the actual leak

# ✅ Correct: Fix the leak AND right-size the limit
resources:
  limits:
    memory: "4Gi"  # Properly sized for fixed workload

Pitfall 2: Forgetting to Limit Concurrency

Raising memory but leaving concurrency unlimited still causes OOM:

# ❌ Wrong: Unbounded concurrency
spec:
  memory: "8Gi"
  concurrency: 500  # Will still OOM

# ✅ Correct: Both limits properly configured
spec:
  memory: "4Gi"
  concurrency: 25  # Bounded for memory budget

Pitfall 3: Not Setting Up Graceful Shutdown

Abrupt termination leaves tasks in inconsistent state:

# ❌ Wrong: No graceful shutdown handling
lifecycle:
  preStop:
    exec:
      command: ["kill", "-15", "1"]  # Forces kill

# ✅ Correct: Graceful drain period
lifecycle:
  preStop:
    exec:
      command: ["/clw/bin/drain", "--timeout=60s"]
terminationGracePeriodSeconds: 90

Pitfall 4: Missing Memory Monitoring

Not monitoring memory until OOM occurs:

# ❌ Wrong: No monitoring
def run_worker():
    worker = Worker()
    worker.start()

# ✅ Correct: Proactive monitoring
def run_worker():
    worker = Worker()
    worker.add_health_check(
        check_fn=check_memory_usage,
        interval="10s",
        threshold_pct=80
    )
    worker.start()

Pitfall 5: Ignoring Native Memory

Java/Golang applications have native memory beyond heap:

# ❌ Wrong: Only monitoring heap
jstat -gc $(pgrep clw-worker) | head -5

# ✅ Correct: Monitoring all memory
cat /proc/$(pgrep clw-worker)/status | grep -i vmrss
# VmRSS:    4096 MB (includes native memory)

Pitfall 6: Not Handling Memory Pressure in Middleware

# ❌ Wrong: No memory awareness
@app.middleware
async def process_task(request, call_next):
    return await call_next(request)

# ✅ Correct: Memory-aware middleware
@app.middleware
async def process_task(request, call_next):
    if is_memory_pressure():
        return Response(
            status=503,
            body="Server overloaded, retry later"
        )
    return await call_next(request)

The following errors are commonly related to clw-worker-oom:

clw-container-limit

Triggered when container-level resource limits (CPU, memory, storage) are approached. Often precedes clw-worker-oom if memory limits are hit before container limits.

[clw-container-limit] Container approaching memory limit
Memory: 1.9GiB / 2GiB (95%)

clw-task-timeout

Tasks that consume excessive memory often run slowly and hit timeout limits. Memory leaks can cause both OOM and timeout errors:

[clw-task-timeout] Task task-1234 exceeded 120s timeout
[clw] Processing time increased 10x due to memory pressure

clw-dispatch-fail

When workers are OOM and restart continuously, the dispatcher cannot route tasks:

[clw-dispatch-fail] No healthy workers available
[clw] All workers in degraded state (restarting from OOM)

clw-gc-pause

Excessive garbage collection pauses (especially in JVM-based workers) indicate memory pressure and often precede OOM:

[clw-gc-pause] GC pause: 2.5s (threshold: 500ms)
[clw] Memory usage: 92% before GC
[clw] Recommendation: Increase memory or fix leaks

clw-swap-thrash

When memory is exhausted and system starts swapping, performance degrades rapidly:

[clw] Swap usage: 2.3 GiB
[clw] Swap-in rate: 500MB/s
[clw] System under memory pressure

clw-connection-drop

Database or API connections may drop when workers are under memory pressure and cannot respond to keepalives:

[clw-connection-drop] PostgreSQL connection closed
[clw] Worker at 98% memory when connection terminated

Quick Reference:

Metric Healthy Warning Critical
Memory Utilization < 70% 70-85% > 85%
OOM Restarts (24h) 0 1-2 > 2
GC Pause < 200ms 200-500ms > 500ms
Swap Usage 0 < 1GB > 1GB

Monitor these metrics to prevent clw-worker-oom from recurring.