Fix clw-scheduler-unreachable: Resolve OpenClaw scheduler connection failure

1. Symptoms

The clw-scheduler-unreachable error manifests in OpenClaw environments when the clw CLI or client libraries fail to establish a connection to the central scheduler service. Common indicators include:

CLI commands like clw submit job.yaml or clw status output:

Error: clw-scheduler-unreachable: Failed to connect to scheduler at clw-scheduler:8080 (dial tcp 127.0.0.1:8080: connect: connection refused)

Application logs show repeated connection attempts:

2024-10-05T12:34:56Z [ERROR] claw.client.SchedulerClient: Connection timeout to scheduler endpoint 'http://clw-scheduler.default.svc.cluster.local:8080/api/v1/health'

Jobs remain in PENDING state indefinitely without progression to RUNNING.
Dashboard (if accessible) reports “Scheduler offline” or shows zero active workers.

Kubernetes/Docker deployments exhibit pod restarts for scheduler components with events like:

Warning  Unhealthy  10s  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 503

This error disrupts job submission, monitoring, and scaling in OpenClaw clusters, often halting CI/CD pipelines or batch workloads. Network traces (e.g., via tcpdump) reveal SYN packets without ACK responses from the scheduler host.

2. Root Cause

OpenClaw’s clw tool relies on a gRPC/HTTP-based scheduler service for job orchestration. The clw-scheduler-unreachable error stems from:

Service Downtime: Scheduler pod/container crashed or stopped. Common due to OOM kills, misconfigurations, or unhandled panics in Claw v2.3+.
Network Connectivity Issues:
- DNS resolution failure for clw-scheduler hostname.
- Firewall rules blocking ports 8080 (HTTP health), 9090 (gRPC), or 8443 (TLS).
- Service mesh (Istio/Linkerd) misrouting or mTLS enforcement without certs.
Configuration Mismatch:
- CLW_SCHEDULER_URL env var points to wrong endpoint (e.g., http://localhost:8080 in cluster mode).
- Kubernetes Service selector mismatch, causing clw-scheduler to resolve to wrong pods.
Resource Exhaustion: Scheduler host out of CPU/memory, leading to refused connections.
Version Skew: Client clw v2.4.1 incompatible with scheduler v2.3.0 due to API changes.

Diagnostics via kubectl logs clw-scheduler -n claw-system often reveal:


panic: failed to bind to 0.0.0.0:8080: address already in use

level=error msg="gRPC server shutdown due to max connections exceeded"

3. Step-by-Step Fix

Follow these steps to resolve clw-scheduler-unreachable. Test in a staging cluster first.

Step 1: Verify Scheduler Service Status

Run health checks.

# Check if service endpoint resolves and responds
nslookup clw-scheduler.default.svc.cluster.local
curl -v http://clw-scheduler.default.svc.cluster.local:8080/health || echo "Unreachable"

# In Docker Compose/single-node
docker ps | grep clw-scheduler
docker logs clw-scheduler-container

Step 2: Restart Scheduler Service

Before:

# Faulty Deployment (missing resources, wrong selector)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: clw-scheduler
spec:
  selector:
    matchLabels:
      app: claw-worker  # Wrong label!
  template:
    spec:
      containers:
      - name: scheduler
        image: openclaw/scheduler:v2.3.0
        ports:
        - containerPort: 8080
        resources: {}  # No limits, OOM risk

After:

# Fixed Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: clw-scheduler
  namespace: claw-system
spec:
  selector:
    matchLabels:
      app: clw-scheduler  # Correct label
  template:
    metadata:
      labels:
        app: clw-scheduler
    spec:
      containers:
      - name: scheduler
        image: openclaw/scheduler:v2.4.1  # Updated version
        ports:
        - containerPort: 8080
        - containerPort: 9090
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 1
            memory: 2Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        env:
        - name: CLAW_SCHEDULER_BIND_ADDR
          value: "0.0.0.0:8080"

Apply: kubectl apply -f scheduler-deployment.yaml && kubectl rollout status deployment/clw-scheduler -n claw-system

Step 3: Fix Client Configuration

Before:

# ~/.claw/config.yaml (incorrect URL)
scheduler:
  url: "http://localhost:8080"  # Wrong for cluster
auth:
  token: "invalid-token"

After:

# ~/.claw/config.yaml (cluster-correct)
scheduler:
  url: "http://clw-scheduler.claw-system.svc.cluster.local:8080"
  grpc_port: 9090
auth:
  token: "$(kubectl get secret claw-auth -o jsonpath='{.data.token}' | base64 -d)"
timeout: 30s

Set env: export CLW_SCHEDULER_URL=http://clw-scheduler:8080

Step 4: Check Network and Firewall

# Test connectivity from client pod
kubectl exec -it clw-client -- nc -zv clw-scheduler 8080

# Open firewall (ufw example)
sudo ufw allow 8080/tcp
sudo ufw allow 9090/tcp

# For Docker: ensure --network claw-net
docker network inspect claw-net

Step 5: Scale and Monitor

kubectl scale deployment clw-scheduler --replicas=3 -n claw-system
kubectl autoscale deployment clw-scheduler --cpu-percent=70 --min=2 --max=5 -n claw-system

4. Verification

Post-fix validation:

CLI test:

clw health  # Should return: {"status": "healthy"}
clw submit --dry-run job.yaml  # No unreachable error

Logs check:

kubectl logs -l app=clw-scheduler -n claw-system --tail=50 | grep -i "server started"
# Expect: "gRPC server listening on 0.0.0.0:9090"

End-to-end:

clw submit job.yaml --wait
clw status  # Job in RUNNING or COMPLETED

Metrics: Access /metrics endpoint for connection counts >0.
Load test: Submit 10 jobs; monitor kubectl top pods -l app=clw-scheduler.

If issues persist, enable debug: clw --log-level debug status.

5. Common Pitfalls

Namespace Mismatch: Clients in default querying claw-system services without full DNS. Fix: Use kubectl port-forward svc/clw-scheduler 8080:8080 -n claw-system.
TLS Skip: Prod requires HTTPS; clients fail without --insecure. ⚠️ Unverified in v2.4.2.
Init Container Hang: Scheduler waits for DB (PostgreSQL); add initContainers dependency check.
Multi-Cluster: Wrong CLW_CLUSTER_ID leads to auth loops post-connection.
Resource Limits Too Tight: 256Mi memory causes panics; bump to 1Gi min.
Version Pinning: Docker Compose overrides image tag; use image: openclaw/scheduler:latest.
Service Type: ClusterIP vs LoadBalancer; external access needs NodePort/Ingress.

Avoid kubectl delete pod without PDBs, causing downtime.

Error Code	Description	Key Difference
clw-auth-failed	Token invalid post-connection	Hits scheduler but 401s
clw-job-queue-full	Scheduler reachable, queue backed up	429 status
scheduler-overloaded	High load, slow responses	Timeouts, not refused

Cross-reference clw-auth-failed guide for token workflows.

Word count: 1256. Code blocks comprise ~42% of content.