1. Symptoms
The clw-router-disconnected error in OpenClaw manifests during distributed actor system operations, typically when routing messages across nodes. OpenClaw, a Rust-based actor framework for building scalable distributed systems, relies on a central router for message forwarding between actors on different nodes.
Common symptoms include:
Application logs flooding with disconnection warnings:
[2024-09-18T10:30:45Z ERROR claw::router] clw-router-disconnected: Router at 192.168.1.100:8080 has unexpectedly disconnected. Last heartbeat: 2024-09-18T10:30:30Z [2024-09-18T10:30:46Z WARN claw::network] Message routing failed for actor 'user-service-42': Router unavailableActor message delivery failures, leading to timeouts:
thread 'claw-runtime' panicked at 'RoutingError::RouterDisconnected { node_id: "node-001", reason: "Connection reset by peer" }'Node isolation: Local actors function, but cross-node communication halts. Use
claw statusCLI to observe:$ claw status Node: local-node - Connected peers: 0/3 (expected) - Router status: DISCONNECTED (last seen: 5m ago) - Active actors: 127/150Performance degradation: Increased latency (>500ms) for routed messages, measurable via OpenClaw’s built-in metrics endpoint (
/metrics):claw_router_reconnects_total{node="local"} 15 claw_router_heartbeat_failures_total{node="router"} 42
This error disrupts microservices or actor clusters, often after network hiccups or restarts.
2. Root Cause
OpenClaw’s router is a dedicated actor handling message routing via TCP/WebSocket connections with periodic heartbeats (default 30s interval). The clw-router-disconnected error triggers when:
Network Interruptions: Firewalls, NAT timeouts, or packet loss sever the TCP connection. Heartbeat acknowledgments fail after 3 missed pings.
Configuration Mismatch: Router bind address/port differs from client node configs. Example: Router binds to
127.0.0.1:8080, but nodes target0.0.0.0:8080.Resource Exhaustion: Router node hits ulimits (e.g., file descriptors <1024), OOM kills, or CPU throttling, causing abrupt closes.
Version Incompatibility: Mismatched OpenClaw versions between router and nodes (e.g., v0.5.2 router with v0.6.0 nodes) break protocol negotiation.
TLS/Proxy Issues: Misconfigured mTLS certs or reverse proxies dropping idle connections.
Inspect logs for precursors:
[DEBUG claw::net] Heartbeat sent to [email protected]:8080 [ERROR claw::net] Heartbeat timeout after 35s
Use `netstat` or `ss` to confirm:
$ ss -tuln | grep 8080 tcp LISTEN 0 128 0.0.0.0:8080 0.0.0.0:*
Root cause is almost always connectivity or config drift in production clusters.
## 3. Step-by-Step Fix
Fix `clw-router-disconnected` by verifying connectivity, aligning configs, and enabling reconnections. Assume OpenClaw v0.6.x on Rust 1.75+.
### Step 1: Verify Router Availability
Ping the router node and check port:
$ nc -zv 192.168.1.100 8080 Connection to 192.168.1.100 8080 port [tcp/*] succeeded!
### Step 2: Align Configurations
**Before:** Mismatched router config (`router.toml`):
```toml
[router]
bind_addr = "127.0.0.1"
port = 8080
heartbeat_interval = "30s"
max_reconnect_attempts = 3
Node config (node.toml):
[network]
router_addr = "192.168.1.100:8081" # Wrong port!
tls_enabled = false
After: Unified configs (router.toml):
[router]
bind_addr = "0.0.0.0"
port = 8080
heartbeat_interval = "20s"
max_reconnect_attempts = 10
idle_timeout = "5m"
Node config (node.toml):
[network]
router_addr = "192.168.1.100:8080"
heartbeat_interval = "20s"
max_reconnect_attempts = 10
reconnect_backoff = "1s,2s,5s"
tls_enabled = false # Match router
Step 3: Update Rust Code for Robust Reconnection
Before: Basic actor spawn without retry logic:
use claw::{Actor, System};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let sys = System::new("local-node").await?;
let router = sys.router("192.168.1.100:8080").await?; // Single connect, no retry
let actor = MyActor::new().start(&sys).await?;
sys.run().await?;
Ok(())
}
After: With exponential backoff and health checks:
use claw::{Actor, System, RouterRef};
use tokio::time::{interval, Duration};
use anyhow::Result;
#[tokio::main]
async fn main() -> Result<()> {
let sys = System::new("local-node").await?;
let mut router = connect_router_with_retry(&sys, "192.168.1.100:8080").await?;
// Periodic health check
let mut health_interval = interval(Duration::from_secs(10));
loop {
health_interval.tick().await;
if let Err(e) = router.health_check().await {
log::warn!("Router health check failed: {}, reconnecting...", e);
router = connect_router_with_retry(&sys, "192.168.1.100:8080").await?;
}
}
}
async fn connect_router_with_retry(sys: &System, addr: &str) -> Result<RouterRef> {
let mut backoff = Duration::from_secs(1);
loop {
match sys.router(addr).await {
Ok(router) => {
// Verify connection
router.ping().await?;
return Ok(router);
}
Err(e) => {
log::error!("Router connect failed: {}, retrying in {:?}", e, backoff);
tokio::time::sleep(backoff).await;
backoff = (backoff * 2).min(Duration::from_secs(60));
}
}
}
}
Step 4: Restart Services
$ claw stop --all
$ claw start router --config router.toml
$ claw start node --config node.toml
Step 5: Enable TLS if Needed (Optional)
Generate certs with claw cert gen and set tls_enabled = true.
4. Verification
Confirm fix with these checks:
CLI status:
$ claw status Node: local-node - Connected peers: 3/3 - Router status: CONNECTED (heartbeat: 2s ago)Logs absence of errors post-5min runtime.
Metrics:
curl http://localhost:9090/metrics | grep claw_router claw_router_reconnects_total{node="local"} 0 # No recent reconnects claw_router_heartbeat_latency_seconds{quantile="0.99"} 0.05End-to-end test: Spawn actors and route 1000 messages:
// Test script for _ in 0..1000 { router.tell("remote-actor", Msg::Ping).await?; }Expect 0 failures.
Simulate load with
claw bench router --duration 60s.
5. Common Pitfalls
- Ignoring Backoff: Setting
max_reconnect_attempts=1causes thundering herd on restarts. - Localhost Binding:
bind_addr="127.0.0.1"blocks remote nodes—use0.0.0.0. - Firewall Rules:
ufw allow 8080orfirewall-cmd --add-port=8080/tcp. - Ulimits:
ulimit -n 65536before starting router. - Docker Networking: Use
--network hostor expose ports correctly. - Version Lock: Pin
claw = "0.6.2"inCargo.tomlacross cluster. - Proxy Interference: NGINX
proxy_read_timeout 300s;for long heartbeats. - ⚠️ Unverified: IPv6 mismatches in mixed environments—test explicitly.
6. Related Errors
| Error Code | Description | Fix Summary |
|---|---|---|
| clw-connection-timeout | Connection init times out | Increase connect_timeout, check DNS |
| clw-heartbeat-failed | Missed heartbeats without full disconnect | Tune heartbeat_interval, network MTU |
| clw-network-partition | Multi-node split-brain | Gossip protocol enable, leader election |
These often cascade from clw-router-disconnected. Monitor with Prometheus exporter.
(Word count: 1247. Code blocks: ~40%)