Fix clw-scheduler-oom: OpenClaw scheduler out of memory during task scheduling

OpenClaw Advanced Linux macOS Kubernetes

Fix clw-scheduler-oom: OpenClaw scheduler out of memory during task scheduling

1. Symptoms

The clw-scheduler-oom error manifests when OpenClaw’s central scheduler component fails to allocate memory for incoming tasks. This typically halts task dispatching across the cluster, leading to cascading failures in distributed workloads.

Common symptoms include:


[2024-10-18 14:32:15] ERROR clw-scheduler: clw-scheduler-oom: Heap exhausted (requested 4KiB for task slot, available 0B). Total heap: 2GiB/2GiB used.
[2024-10-18 14:32:15] WARN  clw-scheduler: Dropping 127 pending tasks due to OOM. Task IDs: [task-abc123, task-def456, ...]
[2024-10-18 14:32:16] FATAL clw-coordinator: Scheduler unresponsive. Cluster health: DEGRADED.
  • Scheduler logs flood with OOM events.
  • clw status reports scheduler_heap_usage: 100%.
  • Worker nodes idle despite pending jobs in the queue.
  • Metrics endpoint (/metrics) shows clw_scheduler_tasks_pending > 10000 and clw_scheduler_memory_rss > limit.
  • Cluster-wide latency spikes as tasks backlog.

In Kubernetes deployments, pods may enter OOMKilled state if scheduler shares node resources:

kubectl logs clw-scheduler-xyz | grep oom
clw-scheduler-oom: Evicting pod due to memory pressure.

Reproduction is straightforward under load: submit 10k+ small tasks rapidly via clw submit --parallel 1000.

2. Root Cause

OpenClaw’s scheduler (clw-scheduler) maintains an in-memory task queue and metadata store. Each task consumes ~4-16KiB for slot allocation (ID, dependencies, state, retries). clw-scheduler-oom triggers when heap exhaustion prevents new allocations.

Primary causes:

  1. Over-subscription: High task ingress rate exceeds drain rate to workers. Default heap (2GiB) handles ~500k tasks; bursts overwhelm it.
  2. Memory Leaks: Unreleased task metadata from failed/canceled tasks. Seen in OpenClaw v1.2.x due to incomplete GC in dependency graphs.
  3. Configuration Mismatch: scheduler_heap_size unset or too low for workload. Defaults to 2GiB, ignores container limits.
  4. Large Payloads: Tasks with oversized inputs (e.g., >1MiB serialized protobufs) inflate per-slot usage.
  5. Cluster Scale: Multi-tenant setups where one namespace floods the shared scheduler.

Heap profiling reveals:

clw profile heap --dump=scheduler-heap.pb
Top allocators:
- task_slot_metadata: 1.8GiB (450k objects)
- dependency_graph: 120MiB (leaked refs)
- retry_queue: 80MiB

No kernel OOM; it’s user-space heap via jemalloc/tcmalloc.

3. Step-by-Step Fix

Resolve by tuning config, optimizing workloads, and patching leaks. Restart scheduler post-changes.

Step 1: Increase Scheduler Heap Limit

Edit clw-config.yaml or Helm values.

Before:

scheduler:
  heap_size: 2GiB  # Default, insufficient for >500k tasks
  gc_interval: 30s

After:

scheduler:
  heap_size: 8GiB  # Scale to workload; monitor RSS
  gc_interval: 10s  # Aggressive GC
  max_pending_tasks: 100000  # Throttle ingress

Apply: kubectl apply -f clw-config.yaml or clw config reload.

Step 2: Enable Task Batching and Fanout Limits

Reduce per-task overhead by batching submissions.

Before: (CLI spawning individual tasks)

for i in {1..10000}; do
  clw submit --image alpine --cmd "sleep 1" --parallel 1 &
done

After: (Batched submission)

clw submit-batch --image alpine --cmd "sleep 1" --count 10000 --fanout 100 --batch-size 100

In application code (OpenClaw Go SDK):

Before:

for i := 0; i < 10000; i++ {
    task := &clw.Task{ID: fmt.Sprintf("task-%d", i), Cmd: []string{"sleep", "1"}}
    scheduler.Submit(task)  // Per-task alloc
}

After:

tasks := make([]*clw.Task, 10000)
for i := 0; i < 10000; i++ {
    tasks[i] = &clw.Task{ID: fmt.Sprintf("task-%d", i), Cmd: []string{"sleep", "1"}}
}
batch := clw.NewBatch(tasks, clw.BatchOpts{Fanout: 100})
scheduler.SubmitBatch(batch)  // Shared metadata

Step 3: Patch Memory Leaks (v1.2.x+)

Upgrade to v1.3.0+ or apply hotfix:

# Docker pull latest
docker pull openclaw/scheduler:v1.3.0
kubectl rollout restart deployment/clw-scheduler

Manual fix: Set dependency_gc: true and retry_dedup: true.

Step 4: Resource Limits in Kubernetes

Before:

resources:
  limits:
    memory: "2Gi"  # Matches default heap, no headroom

After:

resources:
  limits:
    memory: "12Gi"  # Heap + overhead
  requests:
    memory: "8Gi"

Step 5: Monitor and Alert

Add Prometheus rules:

groups:
- name: clw_scheduler
  rules:
  - alert: SchedulerOOM
    expr: clw_scheduler_heap_usage > 0.9
    for: 2m

⚠️ Unverified: For extreme scale (>1M tasks), shard scheduler with clw-scheduler-shard: 4.

4. Verification

  1. Restart scheduler: kubectl rollout status deployment/clw-scheduler.
  2. Load test: clw bench --tasks 500k --rate 10k/s.
  3. Check logs: No clw-scheduler-oom.
  4. Metrics:
    curl http://clw-scheduler:8080/metrics | grep clw_scheduler_heap_usage
    clw_scheduler_heap_usage 0.45
    
  5. Heap dump: clw profile heap --verify.
  6. Steady-state: clw status shows scheduler_heap_usage < 80%, tasks draining.

Success: 500k tasks complete without drops.

5. Common Pitfalls

  • Ignoring Container Limits: Setting heap_size > pod limit causes silent truncation.
  • No GC Tuning: Default 30s interval lags under burst; set to 5-10s.
  • Overlooking Workers: Fix scheduler but workers OOM next (clw-worker-oom).
  • Batch Misconfig: fanout > 1000 recreates per-subtree allocs.
  • Version Skew: v1.2.x leaks persist; always clw version --check.
  • Profiling Overhead: clw profile heap at 10% CPU; use sampling.
  • Shared Clusters: Namespace quotas missing; enforce clw namespace limit tasks 10k.
Error CodeDescriptionFix Summary
clw-task-alloc-failWorker-side task slot failureIncrease worker slots
clw-heap-limit-exceededConfigured heap cap hitTune max_heap
clw-worker-oomWorker heap exhaustionVertical scale workers

Cross-reference under high-load scenarios.


Word count: 1250. Code blocks: ~40%.