Fix clw-agent-limit-exceeded: Resolve OpenClaw maximum agent count exceeded error

1. Symptoms

The clw-agent-limit-exceeded error in OpenClaw manifests when the framework detects that the number of active agents surpasses the configured maximum quota. This typically halts new agent spawns and logs a critical error, disrupting workflows reliant on dynamic agent scaling, such as web crawling, data processing pipelines, or AI task distribution.

Common symptoms include:

[2024-10-18 14:32:15] ERROR [clw-orchestrator] clw-agent-limit-exceeded: Maximum agents (100) reached. Active: 101/100. Rejecting spawn request for task-id: crawl-uuid-1234. [2024-10-18 14:32:15] WARN [clw-orchestrator] Task queue backlog increasing: 50 pending tasks stalled.


Agents in `PENDING` or `FAILED` states accumulate without progression:

$ clw status agents AGENT_ID STATUS TASKS NODE agent-001 RUNNING 5 node-1 … agent-100 RUNNING 3 node-3 agent-101 PENDING 1 - (rejected: limit exceeded) Total active: 101/100


Performance degradation follows: CPU utilization spikes on orchestrator nodes due to repeated spawn retries, and task latency increases exponentially. In Kubernetes deployments, this triggers pod evictions if resource requests are unmet. Docker users see container restarts looping in `healthcheck` failures.

Logs often accompany stack traces pointing to `ClwOrchestrator::spawnAgent()` in the OpenClaw core library (libclw.so), with errno-like codes (e.g., `CLW_ERR_QUOTA=0xE105`).

## 2. Root Cause

OpenClaw enforces strict per-cluster or per-node agent limits to prevent resource exhaustion in distributed environments. The `clw-agent-limit-exceeded` error stems from:

1. **Configuration Caps**: Default `agent_max` in `clw-config.yaml` is 100 agents cluster-wide (or per-node in standalone mode). Exceeding this triggers quota checks in the orchestrator's `AgentPoolManager`.

2. **Scaling Mismatch**: Rapid task influx (e.g., from `clw submit --parallel 200`) overwhelms static limits without horizontal scaling.

3. **Zombie Agents**: Leaked agents from crashes or improper shutdowns (`clw agent stop --force` omitted) inflate counts without cleanup.

4. **License/Edition Limits**: Community edition caps at 50 agents; Enterprise allows 10k+ but requires key validation.

5. **Cluster Imbalance**: In multi-node setups, uneven distribution via poor `node_selector` policies funnels agents to saturated nodes.

Internally, OpenClaw uses an in-memory `std::atomic<uint32_t>` counter synced via etcd (in clustered mode) or Redis. When `++active_agents > agent_max`, it returns `CLW_ERR_LIMIT_EXCEEDED`.

Core quota logic (pseudocode from clw-orchestrator/src/agent_pool.cpp): if (active_agents.load() >= config.agent_max) { log_error(“clw-agent-limit-exceeded: %u/%u”, active_agents, config.agent_max); return CLW_ERR_LIMIT_EXCEEDED; }


## 3. Step-by-Step Fix

Fixing `clw-agent-limit-exceeded` requires increasing quotas, optimizing usage, or scaling infrastructure. Follow these steps sequentially.

### Step 1: Inspect Current Limits and Usage
Query status:

$ clw status cluster –detail Cluster: prod-claw Nodes: 3 (node-1: 40/100 agents, node-2: 35/100, node-3: 26/100) Total agents: 101/300 (cluster max) Quota: agent_max=100/node (default)


### Step 2: Update Configuration
Edit `clw-config.yaml` (or `/etc/clw/clw-config.yaml` system-wide). **Increase `agent_max`** and enable auto-scaling.

**Before:**

```yaml
orchestrator:
  agent_max: 100  # Per-node limit
  cluster_mode: false  # Standalone, no sharing
pool:
  cleanup_interval: 300s
  node_selector: {}

After:

orchestrator:
  agent_max: 500  # Increased per-node
  cluster_max: 2000  # New cluster-wide cap
  cluster_mode: true  # Enable etcd sync
pool:
  cleanup_interval: 60s  # Faster zombie cleanup
  node_selector:
    resources:
      cpu: ">=2"
      memory: ">=8Gi"
  autoscaling:
    min_agents: 50
    max_agents: 1000
    scale_up_threshold: 80%  # utilization

Apply changes:

$ clw config reload --live
Configuration reloaded. Active agents: 101/1500 (new limits).

Step 3: Clean Up Zombies and Restart

Force-terminate excess agents:

$ clw agent list --status=PENDING,FAILED | xargs -r clw agent stop --force
Stopped 15 zombie agents.

Restart orchestrator:

$ sudo systemctl restart clw-orchestrator
# Or Docker: docker compose restart orchestrator

Step 4: Scale Infrastructure (Kubernetes/Docker)

For K8s, update Deployment:

Before:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: clw-node
spec:
  replicas: 3

After:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: clw-node
spec:
  replicas: 10  # Horizontal scale
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: clw-hpa
spec:
  scaleTargetRef:
    kind: Deployment
    name: clw-node
  minReplicas: 5
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

$ kubectl apply -f clw-scaling.yaml

Docker Compose equivalent:

services:
  clw-node:
    deploy:
      replicas: 10

Step 5: Optimize Task Submission

Throttle parallel tasks:

# Instead of: clw submit --parallel 1000 tasks.json
clw submit --parallel 200 --batch-size 10 tasks.json

4. Verification

Confirm resolution:

Check agent counts:

$ clw status agents --summary
Total active: 95/500 (per-node), 285/2000 (cluster). Healthy.

Stress test spawn:

$ clw bench spawn --count 600 --duration 5m
Spawned 600 agents: 100% success. Peak: 450 active.

Monitor logs for 10-15 minutes:

$ tail -f /var/log/clw/orchestrator.log | grep "limit-exceeded"
# No matches = fixed.

In K8s:

$ kubectl get hpa clw-hpa
NAME      REFERENCE                TARGETS   MINPODS   MAXPODS   REPLICAS
clw-hpa   Deployment/clw-node      45%/70%   5         20        8

Success metric: Zero clw-agent-limit-exceeded in logs, agent utilization <80%.

5. Common Pitfalls

No Reload: Changes to clw-config.yaml require clw config reload or restart; live-reload fails if cluster_mode: false.

License Check: Enterprise limits need clw license validate:

$ clw license validate
Valid until 2025-01-01. Max agents: 5000.

Etcd Sync Lag: In clusters, quota propagation takes 5-30s; use clw cluster sync --force.
Resource Starvation: Increasing agent_max without node CPU/RAM causes OOM kills. Monitor with Prometheus:
```
sum(clw_agent_count) > 0.8 * sum(clw_agent_max)
```
Ignoring Autoscaling: Static agent_max ignores HPA; enable autoscaling.enabled: true.
Docker Volume Persist: Configs in ephemeral volumes reset on restart—use volumes: - ./config:/etc/clw.
⚠️ Unverified: On ARM64 macOS, Docker limits may cap at 200 agents regardless of config due to emulation overhead.

clw-memory-limit-exceeded: Agents hit RAM caps post-spawn. Fix: Tune agent_memory_mb.
clw-cpu-quota-exceeded: CPU throttling mid-task. Use cgroups v2.
clw-network-throttle: Egress limits in crawlers. Scale via sidecar proxies.

For OpenClaw v2.4+, migrate to dynamic quotas via API:

curl -X POST http://localhost:8080/api/quotas -d '{"agent_max":1000}'

This guide totals ~1250 words, with code blocks comprising ~40% (YAML, logs, commands). Refer to OpenClaw Docs for version-specifics.