1. Symptoms
When a worker node becomes unreachable in an OpenClaw cluster, you will observe several distinct symptoms that indicate the communication breakdown between the worker and the cluster manager.
Primary Indicators
The most common manifestation is the worker disappearing from the cluster dashboard or appearing with a status of UNREACHABLE. When you query the cluster status using the clw status command, the affected worker will show as unavailable:
$ clw cluster status
CLUSTER: production-cluster
├── Manager: healthy (leader)
├── Workers: 3 total, 1 unreachable
│ ├── worker-01 healthy 192.168.1.101:9090
│ ├── worker-02 UNREACHABLE 192.168.1.102:9090
│ └── worker-03 healthy 192.168.1.103:9090
└── Queue Depth: 247 pending tasks
Log Error Messages
The worker logs will typically contain one or more of the following error messages:
[ERROR] [2024-01-15T10:23:45.123Z] Worker connection lost: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.10:9090: connect: connection refused"
[ERROR] [2024-01-15T10:23:45.456Z] Failed to send heartbeat to manager after 3 attempts
[WARN] [2024-01-15T10:23:46.789Z] Retrying connection in 5 seconds...
Task Processing Impact
Affected tasks that were assigned to the unreachable worker will remain in a PENDING or ASSIGNED state indefinitely. The cluster manager will not receive acknowledgments, leading to potential task duplication if the task timeout is exceeded:
$ clw task list --status PENDING
TASK ID TYPE WORKER AGE RETRY
task-7a8b2c data-process worker-02 15m 0/3
task-8c9d3e image-resize worker-02 15m 0/3
task-9f0e4a video-encode worker-02 15m 0/3
Network-Level Symptoms
From a network perspective, you may observe the following when performing diagnostics:
$ telnet 192.168.1.102 9090
Trying 192.168.1.102...
telnet: Unable to connect to remote host: Connection timed out
$ ping -c 3 192.168.1.102
PING 192.168.1.102 (192.168.1.102) 56(84) bytes of data.
--- 192.168.1.102 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss
The worker node may also become unreachable to other cluster services, indicating a broader network issue rather than an application-specific problem.
2. Root Cause
The clw-worker-unreachable error occurs when the OpenClaw worker cannot establish or maintain a communication channel with the cluster manager. Understanding the underlying causes is essential for implementing the correct remediation strategy.
Network Connectivity Issues
The most frequent cause is network-level problems that prevent TCP connections between the worker and manager nodes. This includes firewall rules blocking the worker communication port (default: 9090), network segmentation issues in containerized environments, or physical network infrastructure failures.
In Kubernetes deployments, common network issues include incorrect podNetworkCIDR configurations, Calico or Cilium CNI plugin misconfigurations, or NetworkPolicy rules that inadvertently block inter-pod communication.
Service-Level Failures
The worker process itself may have crashed, encountered an out-of-memory condition, or entered a deadlock state. When the worker process terminates unexpectedly, the gRPC connection to the manager drops immediately:
// Worker process exit without graceful shutdown
func (w *Worker) Stop() error {
// Missing: notify manager of shutdown
// Missing: drain pending tasks
os.Exit(1) // Abrupt termination
}
TLS/Authentication Failures
OpenClaw supports mutual TLS authentication between workers and managers. If the worker certificate has expired, the certificate Common Name (CN) doesn’t match the expected value, or the certificate chain is incomplete, the gRPC handshake will fail:
[ERROR] TLS handshake failed: x509: certificate has expired or is not yet valid
[ERROR] TLS handshake failed: x509: certificate relies on legacy Common Name field
Resource Exhaustion
Workers running on resource-constrained nodes may become unreachable due to system-level resource exhaustion. Common scenarios include kernel connection tracking table overflow in high-throughput environments:
$ cat /proc/sys/net/netfilter/nf_conntrack_count
786432
$ cat /proc/sys/net/netfilter/nf_conntrack_max
786432
When the connection tracking table reaches capacity, new connections are silently dropped, causing workers to appear unreachable.
Manager-Side Issues
On rare occasions, the cluster manager itself may be overloaded, misconfigured, or experiencing internal failures that prevent it from accepting new worker connections or responding to existing ones.
3. Step-by-Step Fix
Diagnostic Phase
Before implementing any fixes, gather diagnostic information to confirm the root cause:
# Check worker service status
systemctl status openclaw-worker
# View recent worker logs
journalctl -u openclaw-worker --since "30 minutes ago" | tail -100
# Verify network reachability
nc -zv 192.168.1.10 9090
# Check worker configuration
cat /etc/openclaw/worker.yaml
Fix 1: Restart the Worker Service
If the worker process has crashed or entered an error state, restart the service:
# Stop the worker gracefully
sudo systemctl stop openclaw-worker
# Wait for any pending tasks to timeout
sleep 10
# Clear stale state files
sudo rm -f /var/lib/openclaw/worker-state/*.lock
sudo rm -f /var/lib/openclaw/worker-state/heartbeat.tmp
# Start the worker
sudo systemctl start openclaw-worker
# Verify startup
sudo systemctl status openclaw-worker
Before:
# /etc/openclaw/worker.yaml
cluster:
manager_address: "192.168.1.10:9090"
reconnect_interval: "30s"
max_reconnect_attempts: 5
After:
# /etc/openclaw/worker.yaml
cluster:
manager_address: "192.168.1.10:9090"
reconnect_interval: "5s"
max_reconnect_attempts: 10
heartbeat_interval: "10s"
Fix 2: Update Network Configuration
For network-related issues in containerized environments, update the pod network configuration:
# For Kubernetes, edit the worker deployment
kubectl edit deployment openclaw-worker -n openclaw-system
# Add/update environment variables for connection settings
spec:
template:
spec:
containers:
- name: worker
env:
- name: OPENCLAW_MANAGER_HOST
value: "openclaw-manager.openclaw-system.svc.cluster.local"
- name: OPENCLAW_MANAGER_PORT
value: "9090"
- name: GOMAXPROCS
value: "4"
Apply network policy exceptions if needed:
# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-worker-to-manager
namespace: openclaw-system
spec:
podSelector:
matchLabels:
component: worker
policyTypes:
- Ingress
- Egress
egress:
- to:
- podSelector:
matchLabels:
component: manager
ports:
- protocol: TCP
port: 9090
Fix 3: Renew TLS Certificates
If TLS authentication is failing, regenerate the worker certificates:
# Generate new worker certificate request
openssl req -new -newkey rsa:4096 -nodes \
-keyout worker-key.pem \
-out worker.csr \
-subj "/CN=openclaw-worker/O=openclaw-cluster"
# Sign with the cluster CA
openssl x509 -req -in worker.csr \
-CA cluster-ca.crt \
-CAkey cluster-ca-key.pem \
-CAcreateserial \
-out worker.crt \
-days 365 \
-sha256
# Install new certificates
sudo cp worker.crt /etc/openclaw/certs/worker.crt
sudo cp worker-key.pem /etc/openclaw/certs/worker-key.pem
sudo chmod 600 /etc/openclaw/certs/worker-key.pem
# Restart worker
sudo systemctl restart openclaw-worker
Before:
# Expired certificate check
openssl x509 -in /etc/openclaw/certs/worker.crt -noout -dates
notBefore=Jan 15 2023
notAfter=Jan 15 2024
After:
# New certificate check
openssl x509 -in /etc/openclaw/certs/worker.crt -noout -dates
notBefore=Jan 20 2024
notAfter=Jan 20 2025
Fix 4: Increase System Resources
For connection tracking exhaustion, adjust kernel parameters:
# Temporarily increase conntrack limits
echo 1572864 | sudo tee /proc/sys/net/netfilter/nf_conntrack_max
# Make permanent
sudo bash -c 'cat >> /etc/sysctl.conf <<EOF
# OpenClaw worker connection tracking
net.netfilter.nf_conntrack_max = 1572864
net.netfilter.nf_conntrack_tcp_timeout_established = 7200
EOF'
# Apply changes
sudo sysctl -p
Before:
$ cat /proc/sys/net/netfilter/nf_conntrack_max
786432
After:
$ cat /proc/sys/net/netfilter/nf_conntrack_max
1572864
Fix 5: Container Runtime Adjustments
For Docker-based workers, update container runtime settings:
# Stop existing containers
docker-compose down
# Update docker-compose.yaml
Before:
# docker-compose.yaml
version: '3.8'
services:
worker:
image: openclaw/worker:latest
network_mode: host
restart: unless-stopped
After:
# docker-compose.yaml
version: '3.8'
services:
worker:
image: openclaw/worker:latest
networks:
- openclaw-net
restart: unless-stopped
deploy:
resources:
limits:
memory: 2G
reservations:
memory: 1G
sysctls:
- net.netfilter.nf_conntrack_max=1572864
healthcheck:
test: ["CMD", "clw", "health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
networks:
openclaw-net:
driver: bridge
ipam:
config:
- subnet: 172.28.0.0/16
# Restart with new configuration
docker-compose up -d
4. Verification
After implementing fixes, verify that the worker successfully reconnects to the cluster:
Immediate Connectivity Check
# Check worker service status
sudo systemctl status openclaw-worker
# Expected: active (running)
# Verify cluster status
clw cluster status
# Expected: worker shows as "healthy"
Long-Running Verification
Monitor the connection over an extended period to ensure stability:
# Watch cluster status for 5 minutes
watch -n 10 clw cluster status
# Monitor for any disconnections
clw logs --follow --component worker | grep -E "(reconnect|UNREACHABLE|heartbeat)"
Functional Verification
Submit test tasks to verify task processing capability:
# Submit a test task
clw task submit --type ping --payload '{"target": "8.8.8.8"}'
# Monitor task execution
clw task list --worker worker-02 --limit 10
Health Endpoint Verification
For HTTP-based health checks:
# Check worker health endpoint
curl -s http://192.168.1.102:9091/health
# Expected response
{
"status": "healthy",
"connected_to_manager": true,
"last_heartbeat": "2024-01-20T15:30:45Z",
"active_tasks": 2,
"version": "2.4.1"
}
5. Common Pitfalls
Ignoring Grace Period After Restart
A frequent mistake is expecting immediate reconnection after restarting the worker service. The manager may apply a grace period before accepting reconnections to prevent flapping. Wait at least 30 seconds before checking connection status:
# Wrong approach
sudo systemctl restart openclaw-worker && clw cluster status
# Correct approach
sudo systemctl restart openclaw-worker
sleep 30
clw cluster status
Fire-and-Forget Configuration Changes
Modifying configuration files without proper validation leads to cascading failures:
# Never edit configuration while worker is running
# Always validate before deployment
# Validate configuration syntax
clw validate-config /etc/openclaw/worker.yaml
# Test in staging first
clw worker --config /etc/openclaw/worker-staging.yaml --dry-run
Incomplete TLS Certificate Updates
Updating only one side of the TLS handshake causes authentication failures:
# Wrong: Only updating worker certificate
cp new-worker.crt /etc/openclaw/certs/worker.crt
sudo systemctl restart openclaw-worker
# Correct: Ensure CA certificate is also current on worker
cp new-worker.crt /etc/openclaw/certs/worker.crt
cp cluster-ca.crt /etc/openclaw/certs/ca.crt
sudo systemctl restart openclaw-worker
Premature Task Cancellation
Manually canceling tasks assigned to a worker before it fully reconnects can cause data loss:
# Wrong approach
clw task cancel --force task-7a8b2c
# Better approach: Wait for reconnection
# If worker reappears, task will continue automatically
# If worker remains unreachable for >15 minutes, then cancel
clw task cancel --task-id task-7a8b2c --reason "Worker unreachable for 20 minutes"
Environment Variable vs Config File Conflicts
In Kubernetes deployments, environment variables can override configuration files, causing confusion:
# Check effective configuration
kubectl exec -it openclaw-worker-xyz -- clw debug config
# Verify environment variables
kubectl exec -it openclaw-worker-xyz -- env | grep OPENCLAW
Inadequate Monitoring
Failing to monitor the reconnection pattern can mask recurring issues:
# prometheus-alerts.yaml
groups:
- name: openclaw-worker
rules:
- alert: WorkerUnreachable
expr: openclaw_worker_status{status="unreachable"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Worker {{ $labels.worker }} is unreachable"
6. Related Errors
clw-connection-refused
This error occurs when the worker cannot establish a TCP connection to the manager. It typically indicates the manager service is not running or the specified port is blocked:
[ERROR] dial tcp 192.168.1.10:9090: connect: connection refused
This is often a precursor to clw-worker-unreachable when the worker repeatedly fails to connect.
clw-handshake-timeout
Occurs during the TLS/credentials handshake phase after a TCP connection is established. The manager or worker does not complete the authentication within the expected timeframe:
[ERROR] Failed to complete TLS handshake: context deadline exceeded
clw-registration-failed
The worker successfully connects but fails to register with the cluster catalog. This can occur when duplicate worker IDs are detected or the cluster is at capacity:
[ERROR] Registration rejected: worker ID 'worker-02' already exists in cluster
clw-heartbeat-lost
The worker was connected but stopped receiving acknowledgment of heartbeats from the manager. This indicates a one-sided network issue or manager overload:
[WARN] No heartbeat acknowledgment received for 60 seconds
[ERROR] Assuming manager connection lost
clw-task-queue-full
The manager’s internal task queue reaches capacity, causing it to reject new worker connections. Workers may attempt to reconnect repeatedly, creating the appearance of an unreachable worker:
[ERROR] Manager rejected connection: task queue at maximum capacity (10000 tasks)
When diagnosing clw-worker-unreachable, examine the related errors in the worker logs to build a complete picture of the failure chain. Proper root cause analysis requires tracing the sequence from initial connection attempt through to the final unreachable state.