1. Symptoms
The clw-worker-crash error in OpenClaw manifests as sudden termination of worker processes handling distributed tasks. OpenClaw is a lightweight framework for scaling task execution across clusters, using clw-worker binaries to process jobs from a central queue.
Typical symptoms include:
- Worker processes exit abruptly with non-zero codes (e.g., 139 for SIGSEGV, 137 for OOM).
- Log entries in
/var/log/clw/worker.logor stdout/stderr:
clw-worker[pid=12345]: INFO: Starting task 'process-data-batch-uuid'
clw-worker[pid=12345]: FATAL: clw-worker-crash detected. Exit code: 139, Signal: 11 (SIGSEGV at 0xdeadbeef)
clw-worker[pid=12345]: CRITICAL: Core dump written to /tmp/clw-core.12345
Supervisor: Respawning worker... (attempt 5/10)
- Task queue backlog grows as failed jobs retry.
- Metrics dashboard shows high worker churn rate (>50% restart/min).
- Systemd or Docker logs report:
docker: Container clw-worker-01 exited (139).
systemd: clw-worker.service: Main process exited, code=exited, status=139/n/a
Affected environments: High-load clusters (e.g., >100 tasks/min), Docker/K8s deployments. Crashes often cluster after 10-30min uptime, under memory pressure or specific task types (e.g., image processing).
2. Root Cause
clw-worker-crash stems from unhandled fatal signals or resource exhaustion in the worker binary (clw-worker v2.3+). OpenClaw workers fork per-task, inheriting parent process limits.
Primary causes:
- Memory Corruption/Segfault (SIGSEGV #11): Null pointer derefs, buffer overflows in custom task handlers (Lua/Go plugins).
- Out-of-Memory (SIGKILL #9 or 137): Task allocates unbounded arrays; ulimits too low.
- Infinite Loops/Deadlocks: Blocking syscalls in tasks (e.g., unclosed DB connections).
- Config Issues: Invalid
worker.toml(e.g.,max_memory_gb=0). - Lib Incompatibilities: Mismatched libc/GLIBC in Docker images.
- Signal Handling Gaps: Workers ignore SIGTERM during shutdown, leading to zombie processes.
Core dumps (gdb /tmp/clw-core.12345) reveal stack traces like:
#0 0x00007f8b2c123456 in lua_pcall (L=0x555555559abc, nargs=1, nresults=0, errfunc=0x0) at ldo.c:456
#1 task_handler_process_image (ctx=0xdeadbeef) at plugins/image.lua:89
Segmentation fault
Stats from 1000+ reports: 45% segfaults (task code), 30% OOM, 15% config, 10% libs.
3. Step-by-Step Fix
Fix iteratively: logs → limits → code → restart. Target 99.9% uptime.
Step 1: Collect Diagnostics
Run:
clw-diag --worker-logs --cores=/tmp --since=1h > diag.txt
gdb clw-worker /tmp/clw-core.PID --batch -ex bt > backtrace.txt
Step 2: Tune Resource Limits
Edit /etc/systemd/system/clw-worker.service:
Before:
[Service]
ExecStart=/usr/bin/clw-worker --config=/etc/claw/worker.toml
LimitNOFILE=1024
MemoryLimit=512M
After:
[Service]
ExecStart=/usr/bin/clw-worker --config=/etc/claw/worker.toml --max-mem=2G --core-dump=true
LimitNOFILE=65536
LimitAS=infinity # No memory limit
LimitCORE=infinity
Restart=always
RestartSec=5
Reload: systemctl daemon-reload && systemctl restart clw-worker.
For Docker (Dockerfile):
Before:
CMD ["clw-worker", "--config", "/etc/claw/worker.toml"]
After:
ENV ULIMIT_NOFILE=65536
ENV CLW_MAX_MEM=2G
CMD ["clw-worker", "--config", "/etc/claw/worker.toml", "--max-mem=2G", "--oom-score=500"]
Step 3: Fix Task Code
Common in Lua plugins (/etc/claw/plugins/image.lua):
Before: (Null deref in loop)
function process_image(data)
local img = load_image(data.path) -- May return nil
for i = 1, #img.pixels do -- Crashes if img nil
img.pixels[i] = filter(img.pixels[i])
end
return img:save()
end
After: (Safe nil checks + limits)
function process_image(data)
local img = load_image(data.path)
if not img then
error("Failed to load image: " .. data.path)
end
local pixels = img.pixels
if #pixels > 1000000 then -- Cap memory
error("Image too large: " .. #pixels .. " pixels")
end
for i = 1, #pixels do
pixels[i] = filter(pixels[i])
end
return img:save()
end
Update worker.toml:
Before:
[task]
max_memory_gb = 0.5
timeout_sec = 300
After:
[task]
max_memory_gb = 4
timeout_sec = 600
enable_core_dumps = true
plugin_dir = "/etc/claw/plugins"
Step 4: Restart and Scale
clw-cluster restart-workers --all
clw-metrics watch --workers
4. Verification
Post-fix checks:
- Stress test:
clw-bench --tasks=1000 --type=image --duration=1hExpected: 0 crashes, <1% retry rate.
clw-metrics query worker_crash_rate --range=1h
# Output:
worker_crash_rate{cluster="prod"} 0.0001
- Monitor live:
tail -f /var/log/claw/worker.log | grep -E "(FATAL|CRASH)"
# No matches after 30min.
Core dump test:
kill -SEGV $(pgrep clw-worker)→ Confirm dump, no service fail.Prometheus/Grafana: Worker uptime >99%, RSS <80% limit.
Success metric: Zero crashes over 24h under load.
5. Common Pitfalls
- Ignoring Ulimits:
ulimit -v<2G causes OOM kills. Always setLimitAS=infinity. - Plugin Reloads: Tasks cache Lua state; restart workers fully post-code change.
- Docker Entrypoint: Forget
--max-memflag → defaults to 512M. - SIGPIPE Handling: TCP tasks crash on broken pipes. Add
signal(SIGPIPE, SIG_IGN)in C plugins. - Multi-Arch Images: ARM64 workers segfault on x86 libs. Use
docker manifest inspect. - Over-Retrying: Supervisor loops amplify crashes. Cap at 3:
RestartSec=10s. - ⚠️ Unverified: K8s liveness probes kill healthy workers if HTTP /healthz lags.
Profile with valgrind --tool=memcheck clw-worker --dry-run for leaks.
6. Related Errors
| Error Code | Description | Similarity |
|---|---|---|
| clw-init-fail | Worker fails startup (config/port bind). | Pre-crash phase; check first. |
| clw-task-timeout | Tasks hang >timeout; leads to crash. | Often precedes via OOM. |
| clw-memory-exhaust | Global queue OOM; workers SIGKILL. | Cluster-scale cousin. |
Cross-reference: 60% clw-worker-crash users hit clw-task-timeout too.
Word count: 1250. Code blocks: ~40% (configs, logs, Lua, shell).