Fix clw-worker-disconnected: Worker Connection Lost in OpenClaw

1. Symptoms

The clw-worker-disconnected error manifests when the OpenClaw worker process loses its connection to the CloudWatch Logs service or the central coordination server. Users typically encounter this error during active log streaming operations or when attempting to initiate a new worker instance.

Typical Error Output:

[ERROR] clw-worker-disconnected: Worker instance worker-abc123 has been disconnected from the CloudWatch endpoint
[ERROR] Failed to send log batch: connection reset by peer
[WARN] Retrying connection in 5 seconds...
[FATAL] Maximum reconnection attempts (3) exceeded

Observable Symptoms:

Log entries cease to appear in CloudWatch Logs despite the agent running
The worker process enters a reconnection loop with repeated failure messages
Health check endpoints report the worker as unavailable
The OpenClaw dashboard shows the worker in a “disconnected” state with a red status indicator
Memory and CPU usage may spike during reconnection attempts as the process attempts to re-establish the session
Pending log buffers begin to accumulate without being flushed to the destination

In containerized environments, the container may continue running but report unhealthy status through orchestration health checks. The error frequently occurs after network interruptions, infrastructure maintenance events, or when IAM credentials expire and cannot be automatically refreshed.

2. Root Cause

The clw-worker-disconnected error stems from several potential underlying causes, each related to the connection lifecycle between the OpenClaw worker and its target endpoint.

Primary Root Causes:

Network Interruption or Timeout: The most common cause involves TCP connections being terminated unexpectedly due to network path failures, firewall rule changes, or prolonged periods of inactivity that trigger idle connection timeouts. CloudWatch Logs imposes connection limits, and prolonged inactivity can result in the service terminating the connection silently, leading to a disconnected state upon the next write attempt.

IAM Credential Expiration: OpenClaw workers rely on AWS credentials (whether from instance profiles, task roles, or environment variables) to authenticate with CloudWatch. When these credentials expire and automatic rotation fails, the worker cannot maintain its session and is disconnected. This is particularly problematic in long-running worker instances where credential refresh mechanisms are not properly implemented.

Worker Process Crash or Termination: Underlying issues such as out-of-memory conditions, segmentation faults, or deliberate process termination (due to deployment scripts or orchestrator restarts) can cause the worker to disconnect abruptly. If graceful shutdown procedures are not followed, pending log batches may be lost and the connection state becomes inconsistent.

Endpoint Configuration Mismatch: When worker configurations point to incorrect or unreachable CloudWatch endpoints (such as wrong region endpoints or load balancer addresses), connection attempts fail repeatedly. This commonly occurs after infrastructure migrations or when workers are deployed with outdated configuration files.

CloudWatch Service Disruption: Temporary service-side issues or rate limiting can cause CloudWatch to reject connections, resulting in worker disconnection. While typically transient, these events can cascade if the reconnection logic does not implement proper backoff strategies.

3. Step-by-Step Fix

Step 1: Verify Network Connectivity

Begin by confirming that the worker host can reach the CloudWatch endpoint. This diagnostic step identifies whether the issue originates from network routing or firewall policies.

Before:

# Attempt to reach CloudWatch endpoint without verifying connectivity
./openclaw-worker start --config /etc/openclaw/worker.yaml

After:

# Test TCP connectivity to CloudWatch Logs endpoint
nc -zv logs.us-east-1.amazonaws.com 443
telnet logs.us-east-1.amazonaws.com 443
curl -v https://logs.us-east-1.amazonaws.com/

# Verify DNS resolution
nslookup logs.us-east-1.amazonaws.com
dig logs.us-east-1.amazonaws.com

If connectivity tests fail, review security group rules, network ACLs, and any intermediate firewall configurations to ensure outbound traffic on port 443 is permitted.

Step 2: Validate and Refresh AWS Credentials

Ensure that the credentials used by the OpenClaw worker are valid and have appropriate permissions. Check the IAM role attached to the instance or the task role if running in ECS or EKS.

Before:

# worker.yaml - potentially stale or incomplete credentials section
credentials:
  access_key_id: AKIAIOSFODNN7EXAMPLE
  secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

After:

# worker.yaml - use instance/task role or explicitly configured credentials with rotation
credentials:
  use_instance_role: true
  # Or configure credential provider chain for automatic refresh
  credential_refresh_interval: 300

# If using explicit credentials, ensure they are rotated regularly
# Consider using AWS Secrets Manager for credential management

Validate credentials:

# Test AWS CLI access with the same credentials
aws sts get-caller-identity --region us-east-1

# If using instance role
aws sts get-caller-identity

Step 3: Check and Update Worker Configuration

Review the OpenClaw worker configuration for endpoint correctness, buffer settings, and reconnection parameters.

Before:

# worker.yaml - problematic configuration
cloudwatch:
  endpoint: https://logs.us-east-1.amazonaws.com
  log_group: /aws/ec2/production
  stream_name: instance-1
  buffer:
    size: 100
    flush_interval: 1
  connection:
    max_retries: 3
    retry_backoff: 1

After:

# worker.yaml - corrected configuration with proper retry logic
cloudwatch:
  endpoint: https://logs.us-east-1.amazonaws.com
  log_group: /aws/ec2/production
  stream_name: instance-1
  buffer:
    size: 1000
    flush_interval: 5
    max_retention: 10000
  connection:
    max_retries: 10
    retry_backoff: 5
    max_retry_interval: 60
    heartbeat_interval: 30
    idle_timeout: 300

Step 4: Restart the OpenClaw Worker Service

After correcting the configuration and ensuring credentials are valid, restart the worker service to establish a fresh connection.

# Stop the current worker process gracefully
sudo systemctl stop openclaw-worker

# Verify the process has stopped
sudo systemctl status openclaw-worker

# Clear any stale state files
sudo rm -f /var/lib/openclaw/worker.state
sudo rm -f /var/log/openclaw/worker.reconnect.log

# Start the worker with updated configuration
sudo systemctl start openclaw-worker

# Enable automatic restart on failure
sudo systemctl enable openclaw-worker

# Monitor the startup logs
sudo journalctl -u openclaw-worker -f

Step 5: Implement Health Monitoring

Configure external health checks to detect disconnection events proactively and trigger automatic recovery.

# health-monitor.yaml
health_checks:
  - name: worker_connectivity
    type: http
    endpoint: http://localhost:9000/health
    interval: 30
    timeout: 10
    failure_threshold: 3
    action: restart_worker
    
  - name: cloudwatch_write
    type: log_check
    pattern: "clw-worker-disconnected"
    alert_threshold: 1
    action: page_oncall

4. Verification

After implementing the fix, verify that the worker has successfully reconnected and is processing logs correctly.

Check Worker Status:

# Query the worker health endpoint
curl http://localhost:9000/health | jq

# Expected output:
{
  "status": "healthy",
  "connected": true,
  "endpoint": "logs.us-east-1.amazonaws.com",
  "uptime_seconds": 847,
  "logs_sent": 15832,
  "last_successful_flush": "2025-01-15T10:32:45Z"
}

Verify Log Ingestion:

# Check that logs are appearing in CloudWatch
aws logs tail /aws/ec2/production --filter-pattern "openclaw" --since 5m

# Monitor for any new errors
sudo journalctl -u openclaw-worker --since "10 minutes ago" | grep -i error

# Verify connection stability over time
watch -n 10 'curl -s http://localhost:9000/metrics | grep worker_state'

Run Connectivity Tests:

# Execute the built-in connectivity diagnostic
./openclaw-worker diagnose --endpoint https://logs.us-east-1.amazonaws.com

# Expected output:
[OK] DNS resolution: logs.us-east-1.amazonaws.com resolves to 52.94.236.0
[OK] TCP connection: Established to logs.us-east-1.amazonaws.com:443
[OK] TLS handshake: Completed successfully
[OK] Authentication: Valid credentials verified
[OK] Log stream: Accessible and writable
[SUCCESS] All connectivity checks passed

Confirm No Recurrence:

# Monitor logs for 15 minutes to ensure no disconnection events
tail -f /var/log/openclaw/worker.log | grep -E "(disconnected|reconnect|error)" 

# Check system metrics for stable operation
curl http://localhost:9000/metrics | grep -E "(memory_usage|cpu_percent|buffer_size)"

5. Common Pitfalls

Ignoring Credential Expiration: A frequent mistake is configuring static AWS access keys without implementing rotation. Static credentials inevitably expire or become invalid, causing repeated disconnection events. Always prefer instance roles, task roles, or services like AWS Secrets Manager with automatic rotation for production environments.

Misconfiguring Retry Parameters: Setting max_retries to a low value (such as 3) combined with short retry_backoff intervals can cause the worker to exhaust retry attempts before transient network issues resolve. The recommended approach uses exponential backoff with jitter and higher retry limits to handle variable-duration network interruptions gracefully.

Insufficient Buffer Configuration: Workers configured with small buffers (under 100 entries) may lose logs during brief disconnection windows. Conversely, excessively large buffers consume memory and risk losing more data during crashes. The optimal configuration depends on log volume, but buffers between 1000-5000 entries with appropriate flush intervals provide good balance.

Neglecting Graceful Shutdown: Forcing worker termination with kill -9 or abrupt container stops prevents proper connection cleanup and state persistence. Always use graceful shutdown signals (SIGTERM) and ensure shutdown hooks complete before process termination. This preserves pending log batches and connection state.

Overlooking Security Group Changes: After security group rule modifications, existing connections may remain established while new connections fail. Always restart the worker service after security group changes to ensure connections are re-established with updated firewall rules in effect.

Using Wrong Endpoint Regions: Specifying an incorrect AWS region in the endpoint configuration results in consistent connection failures. Verify that the configured region matches the region where your CloudWatch log groups are located, and use regional endpoints rather than global endpoints for optimal performance.

clw-connection-timeout: This error occurs when the OpenClaw worker cannot establish a connection to the CloudWatch endpoint within the specified timeout period. Unlike the disconnection error which affects established connections, this error prevents connections from being established at all. It typically results from incorrect endpoint URLs, DNS resolution failures, or aggressive connection timeout settings. Resolution involves verifying endpoint configuration, increasing timeout values, and confirming network routing.

clw-heartbeat-failure: The heartbeat failure error indicates that the worker failed to respond to periodic health check requests or failed to send heartbeat signals to the coordination server. This error often precedes disconnection events as the monitoring system detects the worker’s unresponsive state. Causes include process CPU saturation, garbage collection pauses, or network latency affecting heartbeat transmission. Implementations should increase heartbeat intervals and ensure adequate system resources.

clw-agent-crash: This critical error occurs when the OpenClaw agent process terminates unexpectedly due to unhandled exceptions, segmentation faults, or out-of-memory conditions. Unlike disconnection errors which may resolve automatically, agent crashes require manual intervention to restart the process and investigate the root cause. Enable core dump collection, review crash logs, and ensure adequate memory allocation to prevent recurrence.

1. Symptoms

2. Root Cause

3. Step-by-Step Fix

Step 1: Verify Network Connectivity

Step 2: Validate and Refresh AWS Credentials

Step 3: Check and Update Worker Configuration

Step 4: Restart the OpenClaw Worker Service

Step 5: Implement Health Monitoring

4. Verification

5. Common Pitfalls

6. Related Errors