Fix clw-scheduler-oom: Scheduler Out of Memory Error

1. Symptoms

The clw-scheduler-oom error manifests when the OpenClaw scheduler service exhausts available system memory during workload scheduling operations. This error typically surfaces under high-load conditions where the scheduler must manage numerous concurrent jobs, evaluate complex dependency graphs, or process large volumes of task queue entries.

Primary symptoms include:

The scheduler process terminates abruptly or enters an unresponsive state. When monitoring the scheduler service, you may observe the process exit with code 137 (128 + 9, indicating SIGKILL) or the container being forcibly terminated by the OOM killer. The OpenClaw control plane logs will contain entries similar to the following:

[ERROR] clw-scheduler-oom: Scheduler memory usage exceeded limit
[FATAL] Aborting scheduler: cannot allocate memory for job queue
[WARN] Worker clw://scheduler-01 terminated unexpectedly (exit 137)

Applications submitting jobs to the scheduler begin experiencing timeout errors or connection failures. The scheduler’s health endpoint returns errors, and the OpenClaw dashboard shows the scheduler as “unhealthy” or “offline.” New job submissions are rejected with errors indicating the scheduler is unavailable.

Under kubectl or container runtime inspection, memory consumption metrics spike to 100% of the configured limit immediately before the failure. The dmesg system logs may contain OOM killer invocation messages:

[12345.678] Out of memory: Killed process 1234 (clw-scheduler) total-vm:4096MB anon-rss:4096MB file-rss:0KB

2. Root Cause

The clw-scheduler-oom error occurs due to memory exhaustion in the scheduler process. Several contributing factors commonly lead to this condition:

Unbounded Job Queue Growth: The scheduler maintains in-memory structures for pending jobs, scheduled tasks, and dependency information. When the job queue grows faster than the scheduler can process it, memory usage accumulates without release. This commonly occurs when upstream systems flood the scheduler with requests while downstream workers are bottlenecked.

Excessive Dependency Graph Complexity: OpenClaw builds dependency graphs for job scheduling. Complex job hierarchies with deep nesting or wide fan-out patterns consume significant memory. A single job with thousands of dependencies or child tasks can cause the scheduler to allocate substantial memory for graph traversal and state tracking.

Insufficient Memory Limits: Containerized deployments often specify memory limits that do not account for peak load scenarios. The default configuration may be adequate for development environments but insufficient for production workloads with higher concurrency.

Memory Leaks in Scheduler Components: Certain operations within the scheduler may not release allocated memory properly. Repeated scheduling of specific job types or continuous operation over extended periods can cause gradual memory consumption that eventually triggers OOM conditions.

Large Task Payloads: Jobs containing large metadata, extensive environment variables, or embedded configuration data increase per-job memory overhead. When many such jobs queue simultaneously, aggregate memory usage exceeds available capacity.

3. Step-by-Step Fix

Step 1: Identify Current Memory Configuration

First, examine the current scheduler deployment configuration to understand existing memory limits and actual consumption:

# Check scheduler pod/resource specifications
kubectl get pod -n openclaw -l component=scheduler -o yaml

# View current memory usage patterns
kubectl top pod -n openclaw -l component=scheduler

# Check scheduler logs for memory-related entries
kubectl logs -n openclaw -l component=scheduler --tail=500 | grep -i "memory\|oom\|alloc"

Step 2: Increase Scheduler Memory Allocation

Modify the scheduler deployment to allocate additional memory. Adjust the container resource limits based on observed peak usage plus a 50% safety margin:

Before:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: clw-scheduler
  namespace: openclaw
spec:
  template:
    spec:
      containers:
      - name: scheduler
        image: openclaw/scheduler:latest
        resources:
          limits:
            memory: "512Mi"
          requests:
            memory: "256Mi"

After:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: clw-scheduler
  namespace: openclaw
spec:
  template:
    spec:
      containers:
      - name: scheduler
        image: openclaw/scheduler:latest
        resources:
          limits:
            memory: "2Gi"
          requests:
            memory: "1Gi"
        env:
        - name: CLW_SCHEDULER_QUEUE_LIMIT
          value: "10000"
        - name: CLW_SCHEDULER_BATCH_SIZE
          value: "100"

Apply the updated configuration:

kubectl apply -f scheduler-deployment.yaml

# Verify the scheduler restarts with new limits
kubectl rollout status deployment/clw-scheduler -n openclaw

Step 3: Configure Queue Limits

Add environment variables to prevent unbounded queue growth. These settings control how many jobs the scheduler holds in memory before applying backpressure:

# Set queue-related environment variables
kubectl set env deployment/clw-scheduler -n openclaw \
  CLW_SCHEDULER_QUEUE_LIMIT=10000 \
  CLW_SCHEDULER_BATCH_SIZE=100 \
  CLW_SCHEDULER_MAX_DEPENDENCY_DEPTH=500

Step 4: Implement Job Pagination for High-Volume Scenarios

For deployments processing large job volumes, implement pagination when querying the job queue:

# Example: Paginated job submission using OpenClaw SDK
from openclaw import Client

client = Client("http://clw-scheduler:8080")

def submit_large_job_batch(jobs, batch_size=100):
    """Submit jobs in batches to prevent scheduler memory spikes."""
    for i in range(0, len(jobs), batch_size):
        batch = jobs[i:i + batch_size]
        client.submit_batch(batch)
        # Allow scheduler to process and release memory
        time.sleep(0.1)

Step 5: Enable Scheduler Horizontal Pod Autoscaling

Configure HPA to scale the scheduler based on memory usage:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: clw-scheduler-hpa
  namespace: openclaw
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: clw-scheduler
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70

Apply the HPA configuration:

kubectl apply -f scheduler-hpa.yaml

# Verify HPA status
kubectl get hpa -n openclaw

4. Verification

After implementing the fixes, verify the scheduler operates correctly under load:

# Check scheduler is running and healthy
kubectl get pods -n openclaw -l component=scheduler
kubectl describe pod -n openclaw -l component=scheduler | grep -A5 "Conditions"

# Monitor memory usage over time
kubectl top pod -n openclaw -l component=scheduler --sort-by=memory

# View scheduler logs for successful operation
kubectl logs -n openclaw -l component=scheduler --tail=100 | grep -E "started|ready|scheduling"

Load test verification:

Submit a test workload to validate scheduler behavior under stress:

# Create a test job submission script
cat > /tmp/test-load.yaml << 'EOF'
apiVersion: batch/v1
kind: Job
metadata:
  name: scheduler-load-test
  namespace: openclaw
spec:
  parallelism: 100
  template:
    spec:
      containers:
      - name: loader
        image: openclaw/load-tester:latest
        args: ["--jobs", "1000", "--target", "http://clw-scheduler:8080"]
      restartPolicy: Never
  backoffLimit: 2
EOF

kubectl apply -f /tmp/test-load.yaml

# Watch scheduler memory during load test
watch -n 5 'kubectl top pod -n openclaw -l component=scheduler'

Verify the scheduler remains responsive after load test completion:

# Test scheduler health endpoint
kubectl exec -n openclaw deployment/clw-scheduler -- curl -s http://localhost:8080/health

# Check scheduler can still accept jobs
kubectl exec -n openclaw deployment/clw-scheduler -- curl -s -X POST \
  http://localhost:8080/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{"name":"health-check","spec":{"image":"alpine","command":["echo","test"]}}'

5. Common Pitfalls

Setting memory limits too low: When initially configuring the scheduler, it is tempting to use conservative memory limits to conserve resources. However, undersized limits cause immediate OOM failures under any meaningful load. Always benchmark actual memory requirements before imposing strict limits.

Ignoring queue backpressure: Without proper queue limits, the scheduler accepts jobs faster than it can process them. This causes memory to accumulate as the in-memory queue grows unbounded. Always configure CLW_SCHEDULER_QUEUE_LIMIT in production deployments.

Single scheduler bottleneck: A single scheduler instance becomes a single point of failure and a throughput bottleneck. High-volume deployments require multiple scheduler replicas with load balancing.

Not monitoring over time: Memory leaks often manifest gradually. Regular monitoring over days or weeks reveals trends that short-term observation misses. Set up alerts for sustained high memory usage rather than only critical OOM conditions.

Neglecting worker capacity: The scheduler queues jobs for workers to execute. If workers are under-provisioned, jobs accumulate in the scheduler queue, consuming memory. Ensure worker pool size matches submission rate.

Insufficient swap configuration: While swap is not a substitute for proper memory allocation, having swap available prevents immediate process termination when memory spikes occur briefly. Configure swap as a safety net, not a primary memory source.

clw-worker-oom: Worker pods experience the same out-of-memory condition when executing job payloads. Workers require appropriately sized memory limits for the tasks they run. See the dedicated troubleshooting guide for worker OOM resolution.

clw-queue-full: This error indicates the job queue has reached its configured maximum capacity. Unlike OOM, the scheduler remains operational but actively rejects new submissions. Often precedes OOM if queue limits are set higher than memory allows.

clw-timeout: Job execution timeouts may occur when the scheduler becomes unresponsive due to memory pressure. The scheduler cannot dispatch jobs to workers, causing pending jobs to exceed their time limits.

clw-resource-exhausted: A general resource exhaustion error that encompasses memory, CPU, and other resource constraints. The scheduler may report this broader error when OOM conditions are detected.