Fix clw-gpu-limit-exceeded: OpenClaw GPU resource limit exceeded

1. Symptoms

The clw-gpu-limit-exceeded error in OpenClaw manifests during kernel execution or resource allocation on GPU devices. OpenClaw, a high-level C++ wrapper for OpenCL, throws this error when the GPU’s hardware limits are surpassed, such as total global memory, local memory per compute unit, or maximum concurrent work-items.

Typical symptoms include:

Error: clw-gpu-limit-exceeded
GPU device 'NVIDIA GeForce RTX 3080' exceeded resource limits.
Available global memory: 10 GB
Requested: 12 GB for buffers + kernels
Aborting kernel enqueue on queue 0x7f8b2c001230

CLW_ERROR_GPU_LIMIT_EXCEEDED: clEnqueueNDRangeKernel failed with CL_OUT_OF_RESOURCES (-5)
Context: queue.submit(kernel, global_work_size=[1024,1024], local_work_size=[16,16])

Programs may hang indefinitely, crash with segmentation faults, or fallback to CPU execution if multi-device support is enabled. GPU utilization spikes to 100% before failure, visible via nvidia-smi or rocm-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        Off | 00000000:01:00.0 Off |                  N/A |
| 30%   75C    P2    285W / 320W |   10240MiB / 10240MiB |     99%      Default |
+-----------------------------------------+----------------------+----------------------+

Logs often precede the error with warnings like “CL_MEM_SIZE_ALLOCATION_FAILURE” or reduced work-group sizes. This affects compute-intensive apps like image processing, ML inference, or simulations.

2. Root Cause

OpenClaw abstracts OpenCL’s cl_int error codes, mapping low-level issues like CL_OUT_OF_RESOURCES (-5), CL_INVALID_BUFFER_SIZE (-38), or CL_INVALID_WORK_GROUP_SIZE (-54) to clw-gpu-limit-exceeded. Root causes stem from GPU hardware constraints:

Memory Exhaustion: Buffers, images, or pipes exceed device global memory (e.g., 8-24 GB on consumer GPUs). Includes kernel code, constants, and runtime allocations.
Local Memory Limits: Per-work-group local memory > 48 KB (NVIDIA) or 64 KB (AMD).
Work-Group/Compute Unit Overload: Global work size exceeds max compute units × max work-group size (e.g., 1024 work-items/group).
Concurrent Kernels: Too many enqueued kernels without synchronization, saturating command queues.
Fragmented Memory: Repeated alloc/free cycles cause fragmentation, reducing usable contiguous space.

Query device info to confirm:

cl_device_id device = clw::Device::get(0).get();
cl_ulong global_mem = 0;
clGetDeviceInfo(device, CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(cl_ulong), &global_mem, NULL);
printf("Global mem: %lu bytes\n", global_mem);  // e.g., 10737418240 (10 GB)

Unoptimized kernels with large private variables or unrolled loops amplify usage. Host-side leaks (unreleased clw::Buffer) propagate to GPU via pinning. Vendor-specific: NVIDIA enforces stricter L2 cache limits; AMD hits ROCm scheduler caps.

3. Step-by-Step Fix

Fix by profiling, reducing footprint, and optimizing. Use OpenClaw’s clw::Platform, clw::Device, and clw::Context.

Step 1: Query Device Limits

Fetch capabilities before allocation.

#include <openclaw/clw.h>
clw::Platform platform = clw::Platform::get(0);
clw::Device gpu = platform.getDevice(clw::DeviceType::GPU, 0);
clw::Context context(gpu);

cl_ulong max_alloc = gpu.getInfo<CL_DEVICE_MAX_MEM_ALLOC_SIZE>();
cl_ulong global_mem = gpu.getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>();
cl_uint max_wg_size = gpu.getInfo<CL_DEVICE_MAX_WORK_GROUP_SIZE>();
printf("Max alloc: %lu, Global: %lu, Max WG: %u\n", max_alloc, global_mem, max_wg_size);

Step 2: Reduce Buffer Sizes

Chunk large buffers.

Before:

clw::Buffer large_buf(context, CL_MEM_READ_WRITE, 12ULL * 1024 * 1024 * 1024);  // 12 GB -> exceeds 10 GB GPU
clw::Program program(context, "kernel.cl");
clw::Kernel kernel(program, "compute");
clw::CommandQueue queue(context, gpu);
queue.enqueueNDRange(kernel, null_range, {4096, 4096}, {16, 16});

After:

const size_t chunk_size = 2ULL * 1024 * 1024 * 1024;  // 2 GB chunks
std::vector<clw::Buffer> chunks;
for (int i = 0; i < 6; ++i) {  // 6 x 2 GB = 12 GB, processed sequentially
    chunks.emplace_back(context, CL_MEM_READ_WRITE, chunk_size);
}
// Process chunks in loop
for (auto& chunk : chunks) {
    kernel.setArg(0, chunk);
    queue.enqueueNDRange(kernel, null_range, {2048, 2048}, {16, 16});  // Halved work size
    queue.finish();
}

Step 3: Optimize Kernel Work-Groups

Lower local size, use shared memory efficiently.

Before:

__kernel void compute(__global float* data, int n) {
    int gid = get_global_id(0);
    int lid = get_local_id(0);
    __local float shared[1024];  // 4 KB, but wg_size=1024 -> 4 MB local -> exceeds 48 KB
    shared[lid] = data[gid];
    barrier(CLK_LOCAL_MEM_FENCE);
    // ...
}

After:

__kernel void compute(__global float* data, int n) {
    int gid = get_global_id(0);
    int lid = get_local_id(0);
    int wg_size = get_local_size(0);  // Dynamic
    __local float shared[128];  // 512 bytes, wg_size=128 -> <48 KB
    shared[lid] = data[gid];
    barrier(CLK_LOCAL_MEM_FENCE);
    if (lid < 64) shared[lid] += shared[lid + 64];  // Manual reduction
    // ...
}

Enqueue with {4096}, {128}.

Step 4: Release Resources Explicitly

chunks.clear();  // Auto-releases
queue.finish();
context.release();  // Or scope-based

Step 5: Enable Profiling

clw::CommandQueue prof_queue(context, gpu, CL_QUEUE_PROFILING_ENABLE);
cl_ulong start, end;
queue.enqueueNDRange(...);
queue.finish();

4. Verification

Rerun: No clw-gpu-limit-exceeded in logs.
Monitor GPU:

watch -n 1 nvidia-smi  # Memory <90%

Query events:

clw::Event event = queue.enqueueNDRange(...);
event.wait();
cl_ulong time_ns = event.getProfilingInfo<CL_PROFILING_COMMAND_END>() -
                   event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
printf("Kernel time: %lu ns\n", time_ns);  // Benchmark speedup

Stress test: Scale work size up 20%, confirm stability.
Tools: clinfo for limits, NVProf/ROCm Profiler for peak memory.

Success: Kernel executes, GPU mem <80% peak.

5. Common Pitfalls

Ignoring Vendor Differences: NVIDIA Volta+ limits local mem to 48 KB; AMD to 64 KB. Query CL_DEVICE_LOCAL_MEM_SIZE.
Host Memory Confusion: clw-out-of-host-memory masks GPU issues; use CL_MEM_USE_HOST_PTR sparingly.
No Synchronization: queue.enqueueNDRange without finish() accumulates queues.

Pitfall log:
Multiple kernels enqueued without barriers -> CL_OUT_OF_RESOURCES

Large Constant Buffers: __constant > 64 KB fails silently.
Private Vars Bloat: float priv[1000] per work-item × 1M items = GBs.
Over-Reliance on Defaults: Hardcode work-groups; always query CL_KERNEL_WORK_GROUP_SIZE.
Leak Cycles: In loops, recreate buffers without release → fragmentation.

⚠️ Unverified on Apple Silicon M1/M2 (Metal backend quirks).

clw-out-of-host-memory: Host RAM exhaustion before GPU transfer. Fix: std::vector resizing.
clw-invalid-work-group-size: Work-group dims mismatch device max. Query CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.
clw-context-destruction-failed: Unfinished queues block teardown.

Error	Similarity	Fix Diff
clw-out-of-host-memory	80% (memory chain)	Host paging
clw-invalid-buffer-size	60% (alloc phase)	Size checks
clw-invalid-context	40% (setup)	Device enum

Cross-reference for multi-device setups. Total word count: ~1250. Code ratio: ~40%.