Fix clw-worker-unreachable: Worker Node Cannot Connect to Cluster

Distributed Computing intermediate Linux macOS Kubernetes Docker

1. Symptoms

When a worker node becomes unreachable in an OpenClaw cluster, you will observe several distinct symptoms that indicate the communication breakdown between the worker and the cluster manager.

Primary Indicators

The most common manifestation is the worker disappearing from the cluster dashboard or appearing with a status of UNREACHABLE. When you query the cluster status using the clw status command, the affected worker will show as unavailable:

$ clw cluster status
CLUSTER: production-cluster
├── Manager:        healthy (leader)
├── Workers:        3 total, 1 unreachable
│   ├── worker-01   healthy    192.168.1.101:9090
│   ├── worker-02   UNREACHABLE 192.168.1.102:9090
│   └── worker-03   healthy    192.168.1.103:9090
└── Queue Depth:    247 pending tasks

Log Error Messages

The worker logs will typically contain one or more of the following error messages:

[ERROR] [2024-01-15T10:23:45.123Z] Worker connection lost: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.10:9090: connect: connection refused"
[ERROR] [2024-01-15T10:23:45.456Z] Failed to send heartbeat to manager after 3 attempts
[WARN]  [2024-01-15T10:23:46.789Z] Retrying connection in 5 seconds...

Task Processing Impact

Affected tasks that were assigned to the unreachable worker will remain in a PENDING or ASSIGNED state indefinitely. The cluster manager will not receive acknowledgments, leading to potential task duplication if the task timeout is exceeded:

$ clw task list --status PENDING
TASK ID        TYPE           WORKER      AGE     RETRY
task-7a8b2c    data-process   worker-02   15m     0/3
task-8c9d3e    image-resize   worker-02   15m     0/3
task-9f0e4a    video-encode   worker-02   15m     0/3

Network-Level Symptoms

From a network perspective, you may observe the following when performing diagnostics:

$ telnet 192.168.1.102 9090
Trying 192.168.1.102...
telnet: Unable to connect to remote host: Connection timed out

$ ping -c 3 192.168.1.102
PING 192.168.1.102 (192.168.1.102) 56(84) bytes of data.
--- 192.168.1.102 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss

The worker node may also become unreachable to other cluster services, indicating a broader network issue rather than an application-specific problem.

2. Root Cause

The clw-worker-unreachable error occurs when the OpenClaw worker cannot establish or maintain a communication channel with the cluster manager. Understanding the underlying causes is essential for implementing the correct remediation strategy.

Network Connectivity Issues

The most frequent cause is network-level problems that prevent TCP connections between the worker and manager nodes. This includes firewall rules blocking the worker communication port (default: 9090), network segmentation issues in containerized environments, or physical network infrastructure failures.

In Kubernetes deployments, common network issues include incorrect podNetworkCIDR configurations, Calico or Cilium CNI plugin misconfigurations, or NetworkPolicy rules that inadvertently block inter-pod communication.

Service-Level Failures

The worker process itself may have crashed, encountered an out-of-memory condition, or entered a deadlock state. When the worker process terminates unexpectedly, the gRPC connection to the manager drops immediately:

// Worker process exit without graceful shutdown
func (w *Worker) Stop() error {
    // Missing: notify manager of shutdown
    // Missing: drain pending tasks
    os.Exit(1)  // Abrupt termination
}

TLS/Authentication Failures

OpenClaw supports mutual TLS authentication between workers and managers. If the worker certificate has expired, the certificate Common Name (CN) doesn’t match the expected value, or the certificate chain is incomplete, the gRPC handshake will fail:

[ERROR] TLS handshake failed: x509: certificate has expired or is not yet valid
[ERROR] TLS handshake failed: x509: certificate relies on legacy Common Name field

Resource Exhaustion

Workers running on resource-constrained nodes may become unreachable due to system-level resource exhaustion. Common scenarios include kernel connection tracking table overflow in high-throughput environments:

$ cat /proc/sys/net/netfilter/nf_conntrack_count
786432
$ cat /proc/sys/net/netfilter/nf_conntrack_max
786432

When the connection tracking table reaches capacity, new connections are silently dropped, causing workers to appear unreachable.

Manager-Side Issues

On rare occasions, the cluster manager itself may be overloaded, misconfigured, or experiencing internal failures that prevent it from accepting new worker connections or responding to existing ones.

3. Step-by-Step Fix

Diagnostic Phase

Before implementing any fixes, gather diagnostic information to confirm the root cause:

# Check worker service status
systemctl status openclaw-worker

# View recent worker logs
journalctl -u openclaw-worker --since "30 minutes ago" | tail -100

# Verify network reachability
nc -zv 192.168.1.10 9090

# Check worker configuration
cat /etc/openclaw/worker.yaml

Fix 1: Restart the Worker Service

If the worker process has crashed or entered an error state, restart the service:

# Stop the worker gracefully
sudo systemctl stop openclaw-worker

# Wait for any pending tasks to timeout
sleep 10

# Clear stale state files
sudo rm -f /var/lib/openclaw/worker-state/*.lock
sudo rm -f /var/lib/openclaw/worker-state/heartbeat.tmp

# Start the worker
sudo systemctl start openclaw-worker

# Verify startup
sudo systemctl status openclaw-worker

Before:

# /etc/openclaw/worker.yaml
cluster:
  manager_address: "192.168.1.10:9090"
  reconnect_interval: "30s"
  max_reconnect_attempts: 5

After:

# /etc/openclaw/worker.yaml
cluster:
  manager_address: "192.168.1.10:9090"
  reconnect_interval: "5s"
  max_reconnect_attempts: 10
  heartbeat_interval: "10s"

Fix 2: Update Network Configuration

For network-related issues in containerized environments, update the pod network configuration:

# For Kubernetes, edit the worker deployment
kubectl edit deployment openclaw-worker -n openclaw-system

# Add/update environment variables for connection settings
spec:
  template:
    spec:
      containers:
      - name: worker
        env:
        - name: OPENCLAW_MANAGER_HOST
          value: "openclaw-manager.openclaw-system.svc.cluster.local"
        - name: OPENCLAW_MANAGER_PORT
          value: "9090"
        - name: GOMAXPROCS
          value: "4"

Apply network policy exceptions if needed:

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-worker-to-manager
  namespace: openclaw-system
spec:
  podSelector:
    matchLabels:
      component: worker
  policyTypes:
  - Ingress
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          component: manager
    ports:
    - protocol: TCP
      port: 9090

Fix 3: Renew TLS Certificates

If TLS authentication is failing, regenerate the worker certificates:

# Generate new worker certificate request
openssl req -new -newkey rsa:4096 -nodes \
  -keyout worker-key.pem \
  -out worker.csr \
  -subj "/CN=openclaw-worker/O=openclaw-cluster"

# Sign with the cluster CA
openssl x509 -req -in worker.csr \
  -CA cluster-ca.crt \
  -CAkey cluster-ca-key.pem \
  -CAcreateserial \
  -out worker.crt \
  -days 365 \
  -sha256

# Install new certificates
sudo cp worker.crt /etc/openclaw/certs/worker.crt
sudo cp worker-key.pem /etc/openclaw/certs/worker-key.pem
sudo chmod 600 /etc/openclaw/certs/worker-key.pem

# Restart worker
sudo systemctl restart openclaw-worker

Before:

# Expired certificate check
openssl x509 -in /etc/openclaw/certs/worker.crt -noout -dates
notBefore=Jan 15 2023
notAfter=Jan 15 2024

After:

# New certificate check
openssl x509 -in /etc/openclaw/certs/worker.crt -noout -dates
notBefore=Jan 20 2024
notAfter=Jan 20 2025

Fix 4: Increase System Resources

For connection tracking exhaustion, adjust kernel parameters:

# Temporarily increase conntrack limits
echo 1572864 | sudo tee /proc/sys/net/netfilter/nf_conntrack_max

# Make permanent
sudo bash -c 'cat >> /etc/sysctl.conf <<EOF
# OpenClaw worker connection tracking
net.netfilter.nf_conntrack_max = 1572864
net.netfilter.nf_conntrack_tcp_timeout_established = 7200
EOF'

# Apply changes
sudo sysctl -p

Before:

$ cat /proc/sys/net/netfilter/nf_conntrack_max
786432

After:

$ cat /proc/sys/net/netfilter/nf_conntrack_max
1572864

Fix 5: Container Runtime Adjustments

For Docker-based workers, update container runtime settings:

# Stop existing containers
docker-compose down

# Update docker-compose.yaml

Before:

# docker-compose.yaml
version: '3.8'
services:
  worker:
    image: openclaw/worker:latest
    network_mode: host
    restart: unless-stopped

After:

# docker-compose.yaml
version: '3.8'
services:
  worker:
    image: openclaw/worker:latest
    networks:
      - openclaw-net
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 2G
        reservations:
          memory: 1G
    sysctls:
      - net.netfilter.nf_conntrack_max=1572864
    healthcheck:
      test: ["CMD", "clw", "health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

networks:
  openclaw-net:
    driver: bridge
    ipam:
      config:
        - subnet: 172.28.0.0/16
# Restart with new configuration
docker-compose up -d

4. Verification

After implementing fixes, verify that the worker successfully reconnects to the cluster:

Immediate Connectivity Check

# Check worker service status
sudo systemctl status openclaw-worker
# Expected: active (running)

# Verify cluster status
clw cluster status
# Expected: worker shows as "healthy"

Long-Running Verification

Monitor the connection over an extended period to ensure stability:

# Watch cluster status for 5 minutes
watch -n 10 clw cluster status

# Monitor for any disconnections
clw logs --follow --component worker | grep -E "(reconnect|UNREACHABLE|heartbeat)"

Functional Verification

Submit test tasks to verify task processing capability:

# Submit a test task
clw task submit --type ping --payload '{"target": "8.8.8.8"}'

# Monitor task execution
clw task list --worker worker-02 --limit 10

Health Endpoint Verification

For HTTP-based health checks:

# Check worker health endpoint
curl -s http://192.168.1.102:9091/health

# Expected response
{
  "status": "healthy",
  "connected_to_manager": true,
  "last_heartbeat": "2024-01-20T15:30:45Z",
  "active_tasks": 2,
  "version": "2.4.1"
}

5. Common Pitfalls

Ignoring Grace Period After Restart

A frequent mistake is expecting immediate reconnection after restarting the worker service. The manager may apply a grace period before accepting reconnections to prevent flapping. Wait at least 30 seconds before checking connection status:

# Wrong approach
sudo systemctl restart openclaw-worker && clw cluster status

# Correct approach
sudo systemctl restart openclaw-worker
sleep 30
clw cluster status

Fire-and-Forget Configuration Changes

Modifying configuration files without proper validation leads to cascading failures:

# Never edit configuration while worker is running
# Always validate before deployment

# Validate configuration syntax
clw validate-config /etc/openclaw/worker.yaml

# Test in staging first
clw worker --config /etc/openclaw/worker-staging.yaml --dry-run

Incomplete TLS Certificate Updates

Updating only one side of the TLS handshake causes authentication failures:

# Wrong: Only updating worker certificate
cp new-worker.crt /etc/openclaw/certs/worker.crt
sudo systemctl restart openclaw-worker

# Correct: Ensure CA certificate is also current on worker
cp new-worker.crt /etc/openclaw/certs/worker.crt
cp cluster-ca.crt /etc/openclaw/certs/ca.crt
sudo systemctl restart openclaw-worker

Premature Task Cancellation

Manually canceling tasks assigned to a worker before it fully reconnects can cause data loss:

# Wrong approach
clw task cancel --force task-7a8b2c

# Better approach: Wait for reconnection
# If worker reappears, task will continue automatically
# If worker remains unreachable for >15 minutes, then cancel
clw task cancel --task-id task-7a8b2c --reason "Worker unreachable for 20 minutes"

Environment Variable vs Config File Conflicts

In Kubernetes deployments, environment variables can override configuration files, causing confusion:

# Check effective configuration
kubectl exec -it openclaw-worker-xyz -- clw debug config

# Verify environment variables
kubectl exec -it openclaw-worker-xyz -- env | grep OPENCLAW

Inadequate Monitoring

Failing to monitor the reconnection pattern can mask recurring issues:

# prometheus-alerts.yaml
groups:
- name: openclaw-worker
  rules:
  - alert: WorkerUnreachable
    expr: openclaw_worker_status{status="unreachable"} == 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Worker {{ $labels.worker }} is unreachable"

clw-connection-refused

This error occurs when the worker cannot establish a TCP connection to the manager. It typically indicates the manager service is not running or the specified port is blocked:

[ERROR] dial tcp 192.168.1.10:9090: connect: connection refused

This is often a precursor to clw-worker-unreachable when the worker repeatedly fails to connect.

clw-handshake-timeout

Occurs during the TLS/credentials handshake phase after a TCP connection is established. The manager or worker does not complete the authentication within the expected timeframe:

[ERROR] Failed to complete TLS handshake: context deadline exceeded

clw-registration-failed

The worker successfully connects but fails to register with the cluster catalog. This can occur when duplicate worker IDs are detected or the cluster is at capacity:

[ERROR] Registration rejected: worker ID 'worker-02' already exists in cluster

clw-heartbeat-lost

The worker was connected but stopped receiving acknowledgment of heartbeats from the manager. This indicates a one-sided network issue or manager overload:

[WARN] No heartbeat acknowledgment received for 60 seconds
[ERROR] Assuming manager connection lost

clw-task-queue-full

The manager’s internal task queue reaches capacity, causing it to reject new worker connections. Workers may attempt to reconnect repeatedly, creating the appearance of an unreachable worker:

[ERROR] Manager rejected connection: task queue at maximum capacity (10000 tasks)

When diagnosing clw-worker-unreachable, examine the related errors in the worker logs to build a complete picture of the failure chain. Proper root cause analysis requires tracing the sequence from initial connection attempt through to the final unreachable state.