1. Symptoms
The clw-worker-oom error manifests through several observable symptoms that can help you quickly identify a memory exhaustion issue in your OpenClaw worker processes.
Primary Symptoms:
[ERROR] Worker process terminated unexpectedly. Exit code: 137 [clw-worker-oom] Worker worker-02 exceeded memory limit: 2147483648 bytes [clw] Memory utilization: 100.0% of 2048MB limit [clw] Active tasks: 12 | Completed: 847 | Failed: 3
**Process-Level Indicators:**
```bash
# Check for OOM killer invocations
dmesg | grep -i "out of memory"
# Output: [12345.678] Worker invoked oom-killer: pid 12345, uid 1000
# Check worker process memory usage
ps aux | grep clw-worker
# Output:
# USER PID %CPU %MEM VSZ RSS COMMAND
# clw 1234 45.2 95.3 3145728 2097152 clw-worker --id=worker-02
Application-Level Symptoms:
- Tasks queue up without being processed
- Worker process exits with code 137 (128 + 9 = SIGKILL)
- Error logs contain
[clw-worker-oom]prefix - Cluster reports
worker_health: degraded - Response latency spikes for dispatched tasks
Observable Metrics:
{
"worker_id": "worker-02",
"error_code": "clw-worker-oom",
"memory_limit_mb": 2048,
"memory_peak_mb": 2100,
"memory_utilization_pct": 102.5,
"timestamp": "2025-01-20T14:32:15Z"
}
2. Root Cause
The clw-worker-oom error occurs when a worker process consumes memory beyond its configured limit. Understanding the root causes helps prevent recurrence.
Primary Causes:
- Unbounded Memory Growth in Task Handlers
Workers process tasks in a loop. If task handlers accumulate state without cleanup, memory grows until the limit is hit:
# Problematic: accumulating results without memory management
class TaskHandler:
def __init__(self):
self.results = [] # Grows indefinitely
def process(self, task):
result = self.execute(task)
self.results.append(result) # Memory leak
return result
- Large Task Payloads
Processing extremely large payloads without streaming or chunking can exhaust memory:
# Problematic: loading entire payload into memory
def handle_task(task):
# This loads a potentially massive payload all at once
data = task.get_full_payload() # Could be 5GB
return self.process_large_file(data)
- Memory Leaks in Native Dependencies
Workers often depend on native libraries (database clients, HTTP libraries) that may leak memory under certain conditions:
[clw] WARNING: Native memory leak detected in libclient.so
[clw] Leaked: 15MB per 1000 requests
[clw] Estimated time to OOM: 4 hours
- Incorrect Memory Configuration
Setting memory limits too low for the workload is a common configuration error:
# Example: undersized memory limit
worker:
resources:
memory: "512Mi" # Too small for expected workload
cpu: "1000m"
- Concurrent Task Overload
Running too many concurrent tasks without memory headroom:
# Problematic: unbounded concurrency
@clw.task_handler
async def process_task(task):
# Each task allocates 200MB
# Running 20 concurrently = 4GB needed
return heavy_operation(task)
# Without limiting concurrency, memory explodes
- Garbage Collection Failures
In languages like Go or Java, failed or stalled garbage collection can lead to memory exhaustion before OOM is triggered.
3. Step-by-Step Fix
Here’s a systematic approach to fix clw-worker-oom errors:
Step 1: Identify Current Memory Configuration
Before:
# Check current worker memory limits
clw config show --format=yaml
Output:
worker:
memory_limit: "2GiB"
task_concurrency: 50
max_payload_size: "100MB"
After:
worker:
memory_limit: "4GiB"
task_concurrency: 20
max_payload_size: "50MB"
Step 2: Review Task Handler Code for Memory Leaks
Before:
import asyncio
from typing import List
class HeavyTaskHandler:
def __init__(self):
self.cache: List[dict] = []
async def process(self, task):
result = await self.compute(task)
# Memory leak: cache grows forever
self.cache.append(result)
# Also storing full payload
task_data = await task.full_data() # Keeps in memory
return result
After:
import asyncio
from collections import OrderedDict
from typing import Dict
class HeavyTaskHandler:
def __init__(self, max_cache_size: int = 100):
self.cache: OrderedDict = OrderedDict()
self.max_cache_size = max_cache_size
async def process(self, task):
result = await self.compute(task)
# Bounded cache with LRU eviction
self.cache[task.id] = result
if len(self.cache) > self.max_cache_size:
self.cache.popitem(last=False)
return result
def clear_cache(self):
self.cache.clear()
Step 3: Implement Streaming for Large Payloads
Before:
async def handle_large_file(task):
# Loads entire file into memory
data = await task.download_payload()
result = process_in_memory(data)
return result
After:
async def handle_large_file(task):
# Stream-based processing
async with task.payload_stream() as stream:
result = await process_streamed(stream)
return result
async def process_streamed(stream):
accumulated_size = 0
chunk_results = []
async for chunk in stream:
result = process_chunk(chunk)
chunk_results.append(result)
accumulated_size += len(chunk)
# Explicit memory management: process in batches
if accumulated_size > CHUNK_THRESHOLD:
chunk_results = aggregate_results(chunk_results)
return final_result(chunk_results)
Step 4: Configure Resource Limits Properly
Before:
# clw-worker.yaml - undersized configuration
apiVersion: clw.io/v1
kind: WorkerConfig
spec:
resources:
limits:
memory: "1Gi"
requests:
memory: "1Gi"
concurrency: 100
taskTimeout: "30s"
After:
# clw-worker.yaml - properly sized configuration
apiVersion: clw.io/v1
kind: WorkerConfig
spec:
resources:
limits:
memory: "4Gi"
ephemeral-storage: "2Gi"
requests:
memory: "2Gi"
concurrency: 25
taskTimeout: "60s"
memoryReservation: "2Gi"
oomScoreAdj: 100
Step 5: Add Memory Monitoring and Auto-Restart
Before:
# Minimal worker setup
def main():
worker = create_worker()
worker.run()
After:
import resource
import signal
import logging
class MonitoredWorker:
def __init__(self):
self.logger = logging.getLogger(__name__)
self.restart_count = 0
self.max_restarts = 5
def setup_memory_monitoring(self):
# Set up periodic memory checks
signal.signal(signal.SIGALRM, self.check_memory)
signal.alarm(30) # Check every 30 seconds
def check_memory(self, signum, frame):
usage = resource.getrusage(resource.RUSAGE_SELF)
memory_mb = usage.ru_maxrss / 1024
self.logger.info(f"Memory usage: {memory_mb:.2f} MB")
if memory_mb > self.memory_threshold_mb:
self.logger.warning(
f"Memory threshold exceeded: {memory_mb}MB > {self.memory_threshold_mb}MB"
)
self.trigger_graceful_shutdown()
def trigger_graceful_shutdown(self):
self.logger.info("Initiating graceful shutdown due to memory pressure")
self.task_queue.drain()
sys.exit(0)
def main():
worker = MonitoredWorker()
worker.setup_memory_monitoring()
worker.run()
Step 6: Configure Kubernetes OOM Handling
# worker-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: clw-worker
labels:
app: clw-worker
spec:
containers:
- name: worker
image: openclaw/worker:2.1.0
resources:
limits:
memory: "4Gi"
requests:
memory: "2Gi"
env:
- name: CLW_MEMORY_LIMIT
valueFrom:
resourceFieldRef:
resource: limits.memory
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "clw-worker drain --timeout=30s"]
restartPolicy: Always
terminationGracePeriodSeconds: 60
4. Verification
After implementing fixes, verify that the clw-worker-oom error is resolved:
Immediate Verification:
# Check worker health status
clw worker status --watch
# Output should show:
# WORKER ID STATUS MEMORY USAGE TASKS
# worker-01 healthy 1.8GiB / 4GiB 12/1000
# worker-02 healthy 2.1GiB / 4GiB 8/500
# Check for OOM events in logs
clw logs --since=1h --filter=oom
# No recent OOM events found
Load Testing Verification:
# Run memory stress test
clw benchmark --workload=memory-intensive --duration=10m --concurrency=50
# Expected output:
# Benchmark completed
# Total tasks: 5000
# Failed: 0 (0.00%)
# Peak memory: 3.2GiB / 4GiB (80%)
# No OOM events detected
Prometheus Metrics Verification:
# Query memory metrics
clw_metrics_memory_usage_max{worker_id="worker-02"}
# Result: 2100000000 (2.1GB, well under 4GB limit)
clw_worker_restart_total{reason="oom"}
# Result: 0 (no OOM restarts in last 24h)
Application-Level Verification:
# Check worker memory metrics
import clw
stats = clw.get_worker_stats()
print(f"Memory Usage: {stats.memory_used_mb}MB / {stats.memory_limit_mb}MB")
print(f"Utilization: {stats.memory_pct}%")
print(f"OOM Count (24h): {stats.oom_count}")
# Expected output:
# Memory Usage: 2100MB / 4096MB
# Utilization: 51.27%
# OOM Count (24h): 0
5. Common Pitfalls
Avoid these common mistakes when fixing clw-worker-oom errors:
Pitfall 1: Increasing Memory Without Fixing Root Cause
Simply raising the memory limit masks the problem rather than fixing it:
# ❌ Wrong: Just throw more memory at it
resources:
limits:
memory: "16Gi" # This avoids fixing the actual leak
# ✅ Correct: Fix the leak AND right-size the limit
resources:
limits:
memory: "4Gi" # Properly sized for fixed workload
Pitfall 2: Forgetting to Limit Concurrency
Raising memory but leaving concurrency unlimited still causes OOM:
# ❌ Wrong: Unbounded concurrency
spec:
memory: "8Gi"
concurrency: 500 # Will still OOM
# ✅ Correct: Both limits properly configured
spec:
memory: "4Gi"
concurrency: 25 # Bounded for memory budget
Pitfall 3: Not Setting Up Graceful Shutdown
Abrupt termination leaves tasks in inconsistent state:
# ❌ Wrong: No graceful shutdown handling
lifecycle:
preStop:
exec:
command: ["kill", "-15", "1"] # Forces kill
# ✅ Correct: Graceful drain period
lifecycle:
preStop:
exec:
command: ["/clw/bin/drain", "--timeout=60s"]
terminationGracePeriodSeconds: 90
Pitfall 4: Missing Memory Monitoring
Not monitoring memory until OOM occurs:
# ❌ Wrong: No monitoring
def run_worker():
worker = Worker()
worker.start()
# ✅ Correct: Proactive monitoring
def run_worker():
worker = Worker()
worker.add_health_check(
check_fn=check_memory_usage,
interval="10s",
threshold_pct=80
)
worker.start()
Pitfall 5: Ignoring Native Memory
Java/Golang applications have native memory beyond heap:
# ❌ Wrong: Only monitoring heap
jstat -gc $(pgrep clw-worker) | head -5
# ✅ Correct: Monitoring all memory
cat /proc/$(pgrep clw-worker)/status | grep -i vmrss
# VmRSS: 4096 MB (includes native memory)
Pitfall 6: Not Handling Memory Pressure in Middleware
# ❌ Wrong: No memory awareness
@app.middleware
async def process_task(request, call_next):
return await call_next(request)
# ✅ Correct: Memory-aware middleware
@app.middleware
async def process_task(request, call_next):
if is_memory_pressure():
return Response(
status=503,
body="Server overloaded, retry later"
)
return await call_next(request)
6. Related Errors
The following errors are commonly related to clw-worker-oom:
clw-container-limit
Triggered when container-level resource limits (CPU, memory, storage) are approached. Often precedes clw-worker-oom if memory limits are hit before container limits.
[clw-container-limit] Container approaching memory limit
Memory: 1.9GiB / 2GiB (95%)
clw-task-timeout
Tasks that consume excessive memory often run slowly and hit timeout limits. Memory leaks can cause both OOM and timeout errors:
[clw-task-timeout] Task task-1234 exceeded 120s timeout
[clw] Processing time increased 10x due to memory pressure
clw-dispatch-fail
When workers are OOM and restart continuously, the dispatcher cannot route tasks:
[clw-dispatch-fail] No healthy workers available
[clw] All workers in degraded state (restarting from OOM)
clw-gc-pause
Excessive garbage collection pauses (especially in JVM-based workers) indicate memory pressure and often precede OOM:
[clw-gc-pause] GC pause: 2.5s (threshold: 500ms)
[clw] Memory usage: 92% before GC
[clw] Recommendation: Increase memory or fix leaks
clw-swap-thrash
When memory is exhausted and system starts swapping, performance degrades rapidly:
[clw] Swap usage: 2.3 GiB
[clw] Swap-in rate: 500MB/s
[clw] System under memory pressure
clw-connection-drop
Database or API connections may drop when workers are under memory pressure and cannot respond to keepalives:
[clw-connection-drop] PostgreSQL connection closed
[clw] Worker at 98% memory when connection terminated
Quick Reference:
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Memory Utilization | < 70% | 70-85% | > 85% |
| OOM Restarts (24h) | 0 | 1-2 | > 2 |
| GC Pause | < 200ms | 200-500ms | > 500ms |
| Swap Usage | 0 | < 1GB | > 1GB |
Monitor these metrics to prevent clw-worker-oom from recurring.