1. Symptoms
The clw-scheduler-unreachable error manifests in OpenClaw environments when the clw CLI or client libraries fail to establish a connection to the central scheduler service. Common indicators include:
- CLI commands like
clw submit job.yamlorclw statusoutput:Error: clw-scheduler-unreachable: Failed to connect to scheduler at clw-scheduler:8080 (dial tcp 127.0.0.1:8080: connect: connection refused) - Application logs show repeated connection attempts:
2024-10-05T12:34:56Z [ERROR] claw.client.SchedulerClient: Connection timeout to scheduler endpoint 'http://clw-scheduler.default.svc.cluster.local:8080/api/v1/health' - Jobs remain in
PENDINGstate indefinitely without progression toRUNNING. - Dashboard (if accessible) reports “Scheduler offline” or shows zero active workers.
- Kubernetes/Docker deployments exhibit pod restarts for scheduler components with events like:
Warning Unhealthy 10s kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
This error disrupts job submission, monitoring, and scaling in OpenClaw clusters, often halting CI/CD pipelines or batch workloads. Network traces (e.g., via tcpdump) reveal SYN packets without ACK responses from the scheduler host.
2. Root Cause
OpenClaw’s clw tool relies on a gRPC/HTTP-based scheduler service for job orchestration. The clw-scheduler-unreachable error stems from:
- Service Downtime: Scheduler pod/container crashed or stopped. Common due to OOM kills, misconfigurations, or unhandled panics in Claw v2.3+.
- Network Connectivity Issues:
- DNS resolution failure for
clw-schedulerhostname. - Firewall rules blocking ports 8080 (HTTP health), 9090 (gRPC), or 8443 (TLS).
- Service mesh (Istio/Linkerd) misrouting or mTLS enforcement without certs.
- DNS resolution failure for
- Configuration Mismatch:
CLW_SCHEDULER_URLenv var points to wrong endpoint (e.g.,http://localhost:8080in cluster mode).- Kubernetes Service selector mismatch, causing
clw-schedulerto resolve to wrong pods.
- Resource Exhaustion: Scheduler host out of CPU/memory, leading to refused connections.
- Version Skew: Client
clwv2.4.1 incompatible with scheduler v2.3.0 due to API changes.
Diagnostics via kubectl logs clw-scheduler -n claw-system often reveal:
panic: failed to bind to 0.0.0.0:8080: address already in use
or
level=error msg="gRPC server shutdown due to max connections exceeded"
3. Step-by-Step Fix
Follow these steps to resolve clw-scheduler-unreachable. Test in a staging cluster first.
Step 1: Verify Scheduler Service Status
Run health checks.
# Check if service endpoint resolves and responds
nslookup clw-scheduler.default.svc.cluster.local
curl -v http://clw-scheduler.default.svc.cluster.local:8080/health || echo "Unreachable"
# In Docker Compose/single-node
docker ps | grep clw-scheduler
docker logs clw-scheduler-container
Step 2: Restart Scheduler Service
Before:
# Faulty Deployment (missing resources, wrong selector)
apiVersion: apps/v1
kind: Deployment
metadata:
name: clw-scheduler
spec:
selector:
matchLabels:
app: claw-worker # Wrong label!
template:
spec:
containers:
- name: scheduler
image: openclaw/scheduler:v2.3.0
ports:
- containerPort: 8080
resources: {} # No limits, OOM risk
After:
# Fixed Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: clw-scheduler
namespace: claw-system
spec:
selector:
matchLabels:
app: clw-scheduler # Correct label
template:
metadata:
labels:
app: clw-scheduler
spec:
containers:
- name: scheduler
image: openclaw/scheduler:v2.4.1 # Updated version
ports:
- containerPort: 8080
- containerPort: 9090
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1
memory: 2Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
env:
- name: CLAW_SCHEDULER_BIND_ADDR
value: "0.0.0.0:8080"
Apply: kubectl apply -f scheduler-deployment.yaml && kubectl rollout status deployment/clw-scheduler -n claw-system
Step 3: Fix Client Configuration
Before:
# ~/.claw/config.yaml (incorrect URL)
scheduler:
url: "http://localhost:8080" # Wrong for cluster
auth:
token: "invalid-token"
After:
# ~/.claw/config.yaml (cluster-correct)
scheduler:
url: "http://clw-scheduler.claw-system.svc.cluster.local:8080"
grpc_port: 9090
auth:
token: "$(kubectl get secret claw-auth -o jsonpath='{.data.token}' | base64 -d)"
timeout: 30s
Set env: export CLW_SCHEDULER_URL=http://clw-scheduler:8080
Step 4: Check Network and Firewall
# Test connectivity from client pod
kubectl exec -it clw-client -- nc -zv clw-scheduler 8080
# Open firewall (ufw example)
sudo ufw allow 8080/tcp
sudo ufw allow 9090/tcp
# For Docker: ensure --network claw-net
docker network inspect claw-net
Step 5: Scale and Monitor
kubectl scale deployment clw-scheduler --replicas=3 -n claw-system
kubectl autoscale deployment clw-scheduler --cpu-percent=70 --min=2 --max=5 -n claw-system
4. Verification
Post-fix validation:
CLI test:
clw health # Should return: {"status": "healthy"} clw submit --dry-run job.yaml # No unreachable errorLogs check:
kubectl logs -l app=clw-scheduler -n claw-system --tail=50 | grep -i "server started" # Expect: "gRPC server listening on 0.0.0.0:9090"End-to-end:
clw submit job.yaml --wait clw status # Job in RUNNING or COMPLETEDMetrics: Access
/metricsendpoint for connection counts >0.Load test: Submit 10 jobs; monitor
kubectl top pods -l app=clw-scheduler.
If issues persist, enable debug: clw --log-level debug status.
5. Common Pitfalls
- Namespace Mismatch: Clients in
defaultqueryingclaw-systemservices without full DNS. Fix: Usekubectl port-forward svc/clw-scheduler 8080:8080 -n claw-system. - TLS Skip: Prod requires HTTPS; clients fail without
--insecure. ⚠️ Unverified in v2.4.2. - Init Container Hang: Scheduler waits for DB (PostgreSQL); add
initContainersdependency check. - Multi-Cluster: Wrong
CLW_CLUSTER_IDleads to auth loops post-connection. - Resource Limits Too Tight: 256Mi memory causes panics; bump to 1Gi min.
- Version Pinning: Docker Compose overrides image tag; use
image: openclaw/scheduler:latest. - Service Type: ClusterIP vs LoadBalancer; external access needs NodePort/Ingress.
Avoid kubectl delete pod without PDBs, causing downtime.
6. Related Errors
| Error Code | Description | Key Difference |
|---|---|---|
| clw-auth-failed | Token invalid post-connection | Hits scheduler but 401s |
| clw-job-queue-full | Scheduler reachable, queue backed up | 429 status |
| scheduler-overloaded | High load, slow responses | Timeouts, not refused |
Cross-reference clw-auth-failed guide for token workflows.
Word count: 1256. Code blocks comprise ~42% of content.