Fix clw-gpu-crash: GPU Memory Segmentation Fault in OpenClaw Compute Workloads

1. Symptoms

The clw-gpu-crash error occurs when an OpenClaw compute workload encounters a critical failure at the GPU level. This manifests as an abrupt termination of the GPU computation process, often leaving the device in an undefined state.

Observable Symptoms

The most common symptoms reported by developers include:

Sudden process termination: The OpenClaw worker process exits with a non-zero exit code immediately after launching GPU kernels.
Device becomes unresponsive: After the crash, subsequent GPU operations return CL_DEVICE_NOT_AVAILABLE or similar errors until the device is reset.
dmesg kernel errors: On Linux systems, the kernel ring buffer may contain entries indicating GPU memory access violations:

[  123.456789] NVRM: Xid (PCI:0000:01:00): GPU Crash, reason: GF100
[  123.456890] NVRM: Xid (PCI:0000:01:00): GPU memory access violation at address 0x12345678
[  123.456891] NVRM: Xid (PCI:0000:01:00):   - GPU 0000:01:00.0: GPU has fallen off the bus

Error log output: The OpenClaw runtime emits the following error message:

[ERROR] OpenClaw Worker: clw-gpu-crash detected
[ERROR]   Device: NVIDIA Tesla T4 (ID: 0)
[ERROR]   Workload: matrix_multiply_v2.clw
[ERROR]   Crash type: GPU_MEMORY_SEGFAULT
[ERROR]   Context dump saved to: /var/log/openclaw/crash_20241230_143255.dmp

Partial results: In some cases, the GPU may have completed a portion of the workload before crashing, leaving partial output in device memory.
Timeout behavior: If Watchdog timers are enabled, the system may report a kernel execution timeout before the crash is officially detected.

Secondary Symptoms

After a clw-gpu-crash, you may observe:

Memory leaks in the host application due to improperly released OpenCL/CUDA resources.
Other GPU workloads on the same device fail to initialize.
NVIDIA-smi may show the GPU in “Error” state until reset:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05   Driver Version: 535.154.05   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute-M. |
|===============================+======================+======================|
|   0  Tesla T4          Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   E    35W /  70W |       0MiB / 15360MiB |      0%      Default      |
+-------------------------------+----------------------+----------------------+
|   0  GPU HAS FALLEN OFF THE BUS. GPU recovery required.                     |
+-------------------------------+----------------------+----------------------+

2. Root Cause

The clw-gpu-crash error originates from critical memory access violations at the GPU hardware level. Understanding the root causes is essential for implementing effective fixes.

Primary Root Causes

1. Out-of-Bounds Global Memory Access

The most frequent cause is a kernel accessing global device memory outside the bounds of allocated buffers. This occurs when:

Array indices are calculated incorrectly, leading to negative indexing or indices exceeding allocated size.
Pointer arithmetic walks past the end of a buffer.
Misaligned reads/writes that span memory pages without proper padding.

// Example problematic kernel code
__kernel void process_pixels(__global uchar* image, int width, int height) {
    int idx = get_global_id(0);
    intidy = get_global_id(1);
    // Bug: Not checking bounds before accessing pixel
    int pixel_index = idy * width * 3 + idx * 3;  // May overflow on boundary
    image[pixel_index] = 255;  // Undefined behavior at edges
}

2. Shared Memory Bank Conflicts

Excessive shared memory bank conflicts can cause stalls that, when combined with other memory access issues, trigger a crash during memory operations.

3. Invalid Pointer or Null Pointer Dereference

Passing null device pointers to kernels or using pointers to freed/alienated memory:

__kernel void compute(__global float* input, __global float* output) {
    // If 'input' is NULL, this causes a segfault
    output[get_global_id(0)] = input[get_global_id(0)] * 2.0f;
}

4. Stack Overflow in Kernel

Excessive local array declarations within kernel functions that exceed the per-thread local memory allocation:

__kernel void bad_kernel(__global float* data) {
    // Too large for stack - each thread allocates 16KB on stack
    float local_buffer[4096];  // 4096 * 4 bytes = 16KB per thread
    // ...
}

5. Race Conditions in Cooperative Groups

Incorrect synchronization between threads in cooperative groups can lead to shared memory corruption:

__kernel void race_condition_kernel(__global atomic_int* counter) {
    __local int temp;
    temp = counter[get_global_id(0)];
    // Missing barrier before shared access
    // Other threads may modify temp simultaneously
    barrier(CLK_LOCAL_MEM_FENCE);  // Barrier placed too late
}

6. Driver/PHY Level Instability

Occasionally, the GPU hardware or driver encounters an uncorrectable error:

ECC memory errors (on supported GPUs)
PCIe bus errors
Thermal throttling leading to unreliable operation
Power supply irregularities

Underlying Technical Mechanism

When the GPU’s memory management unit (MMU) detects an illegal memory access, it raises an exception. The NVIDIA driver (or AMD ROCm stack) catches this exception, attempts to isolate the faulting context, and signals the error to the host application through the OpenCL or CUDA runtime API. OpenClaw intercepts this signal and translates it into the clw-gpu-crash error code.

The crash dump file (*.dmp) contains:

Register state at the time of crash
Memory access address that triggered the fault
Call stack within the kernel
GPU state including current kernel and workgroup information

3. Step-by-Step Fix

Resolving clw-gpu-crash requires systematic debugging and code remediation. Follow these steps in order:

Step 1: Enable Detailed GPU Logging

Before attempting fixes, enable maximum verbosity in OpenClaw to capture crash details:

Before:

# openclaw.yaml
logging:
  level: info

After:

# openclaw.yaml
logging:
  level: debug
  gpu_debug: true
  crash_dump: true
  crash_dump_path: /var/log/openclaw/crashes

Step 2: Parse the Crash Dump

Locate the crash dump file referenced in the error log and examine it:

# Identify crash dump location from error message
cat /var/log/openclaw/crash_20241230_143255.dmp

# Use OpenClaw's crash analyzer tool
openclaw-crash-analyze /var/log/openclaw/crashes/latest.dmp

The analyzer outputs the faulting memory address and instruction pointer:

CRASH ANALYSIS REPORT
=====================
Fault Type:        GLOBAL_MEMORY_ACCESS_VIOLATION
Fault Address:     0xFFFF000012345678
Instruction Ptr:   0x1008C (kernel: matrix_multiply_v2.clw:147)
Register State:
  PC: 0x1008C
  ADDR: 0xFFFF000012345678
  WGRP_ID: (23, 5, 0)
  THREAD_ID: 128

Step 3: Add Bounds Checking to Kernel Code

Before:

__kernel void matrix_multiply(
    __global float* A,
    __global float* B,
    __global float* C,
    int M, int N, int K) {
    
    int row = get_global_id(0);
    int col = get_global_id(1);
    
    // No bounds checking - dangerous
    float sum = 0.0f;
    for (int i = 0; i < K; i++) {
        sum += A[row * K + i] * B[i * N + col];
    }
    C[row * N + col] = sum;
}

After:

__kernel void matrix_multiply(
    __global float* A,
    __global float* B,
    __global float* C,
    int M, int N, int K) {
    
    int row = get_global_id(0);
    int col = get_global_id(1);
    
    // Bounds check before any memory access
    if (row >= M || col >= N) {
        return;  // Guard clause - exit safely
    }
    
    float sum = 0.0f;
    for (int i = 0; i < K; i++) {
        sum += A[row * K + i] * B[i * N + col];
    }
    C[row * N + col] = sum;
}

Step 4: Implement Proper Pointer Validation in Host Code

Before:

import pyopenclaw

def launch_workload():
    # No validation before kernel launch
    kernel = pyopenclaw.Kernel("compute")
    buffer = pyopenclaw.Buffer(size=1024)
    kernel.set_arg(0, buffer)
    kernel.launch(global_size, local_size)

After:

import pyopenclaw

def launch_workload():
    kernel = pyopenclaw.Kernel("compute")
    
    # Explicit validation
    if not kernel.is_valid():
        raise RuntimeError("Kernel binary is invalid")
    
    # Validate buffer before setting
    buffer_size = 1024
    buffer = pyopenclaw.Buffer(size=buffer_size)
    
    if buffer.size() < buffer_size:
        raise RuntimeError(f"Insufficient buffer allocated: {buffer.size()} < {buffer_size}")
    
    kernel.set_arg(0, buffer)
    
    # Check for NULL before launch
    if buffer.address() == 0:
        raise RuntimeError("Buffer pointer is NULL - memory allocation failed")
    
    kernel.launch(global_size, local_size)

Step 5: Fix Shared Memory Allocation Issues

Before:

__kernel void image_process(__global uchar* input, __global uchar* output) {
    // Stack allocation - causes overflow on many GPUs
    uchar local_scratch[8192];
    float intermediate[2048];
    
    for (int i = 0; i < 8192; i++) {
        local_scratch[i] = input[get_global_id(0) * 8192 + i];
    }
    // Processing...
}

After:

__kernel void image_process(
    __global uchar* input, 
    __global uchar* output,
    __local float* shared_scratch) {  // Use shared memory parameter
    
    // Shared memory is pre-allocated by host
    // Host code: kernel.set_local_arg(2, clbuild.localMemorySize(8192 * sizeof(float)))
    
    float intermediate[256];  // Reduced - fits in registers/stack
    
    for (int i = 0; i < get_local_size(0); i++) {
        // Bounds-checked access using local ID
        int local_idx = get_local_id(0);
        int global_idx = get_global_id(0) * get_local_size(0) + local_idx;
        if (global_idx < input_size) {
            shared_scratch[local_idx] = (float)input[global_idx];
        }
    }
    
    barrier(CLK_LOCAL_MEM_FENCE);
    // Processing with shared memory...
}

Step 6: Add Synchronization Barriers

Before:

__kernel void parallel_reduce(__global float* data, __local float* partial) {
    int tid = get_local_id(0);
    int gid = get_global_id(0);
    
    // Load into shared memory
    partial[tid] = data[gid];
    // Missing barrier - other threads may not have finished loading
    
    // Incorrect reduction - race condition
    for (int s = 1; s < get_local_size(0); s *= 2) {
        if (tid % (2 * s) == 0) {
            partial[tid] += partial[tid + s];  // Race condition!
        }
    }
}

After:

__kernel void parallel_reduce(__global float* data, __local float* partial) {
    int tid = get_local_id(0);
    int gid = get_global_id(0);
    unsigned int local_size = get_local_size(0);
    
    // Phase 1: Load into shared memory
    partial[tid] = data[gid];
    
    // CRITICAL: Barrier to ensure all threads complete their loads
    barrier(CLK_LOCAL_MEM_FENCE);
    
    // Phase 2: Reduction in shared memory
    for (int s = local_size / 2; s > 0; s >>= 1) {
        if (tid < s) {
            partial[tid] += partial[tid + s];
        }
        // CRITICAL: Barrier at end of each reduction step
        barrier(CLK_LOCAL_MEM_FENCE);
    }
}

Step 7: Verify GPU Health

If software fixes do not resolve the issue, verify hardware integrity:

# Reset the NVIDIA driver without rebooting
sudo nvidia-smi --gpu-reset

# Run NVIDIA diagnostics
nvidia-smi -q
nvidia-debugdump --output-file=debug_dump.gpu

# Check for ECC errors (on Tesla/Quadro cards)
nvidia-smi -q -d ECC_ERRORS

# Monitor thermal and power status during workload
nvidia-smi dmon -s u -c 60

4. Verification

After implementing fixes, verify that the clw-gpu-crash error no longer occurs:

Test 1: Run Workload with Error Recovery

import pyopenclaw
import time

def run_with_verification():
    workload = pyopenclaw.Workload("matrix_multiply_v2")
    
    # Configure for crash recovery
    workload.config({
        "enable_crash_recovery": True,
        "max_retry_attempts": 3,
        "gpu_timeout_ms": 30000
    })
    
    start_time = time.time()
    result = workload.execute()
    elapsed = time.time() - start_time
    
    if result.success:
        print(f"Workload completed successfully in {elapsed:.2f}s")
        print(f"Output shape: {result.output.shape}")
        print(f"Peak GPU memory: {result.peak_memory_mb:.2f} MB")
        return True
    else:
        print(f"Workload failed: {result.error_code}")
        return False

Test 2: Validate with Memory Sanitizer

Enable GPU memory sanitizer to catch out-of-bounds access before they cause crashes:

# Run with GPU address sanitizer
export OPENCLAW_ENABLE_GPU_ASAN=1
openclaw run --workload matrix_multiply_v2.clw

# Monitor output for sanitizer violations
# GPU-ASAN will report violations like:
# ==12345==GPU-ERROR: global memory access violation at 0x7f1234567890
# ==12345==  Type: READ
# ==12345==  Kernel: matrix_multiply_v2.clw:147
# ==12345==  Thread: (23, 5, 128)
# ==12345==  Bounds: [0x0, 0x3B9ACA00)

Test 3: Stress Test Kernel Execution

Run the kernel through extensive testing to ensure stability:

// Test kernel with boundary conditions
__kernel void test_bounds_kernel(
    __global int* output,
    int buffer_size,
    int test_pattern) {
    
    int gid = get_global_id(0);
    
    // Test all boundary conditions
    if (gid == 0 || gid == buffer_size - 1) {
        output[gid] = test_pattern;  // Edge cases
    }
    
    if (gid == buffer_size / 2) {
        output[gid] = test_pattern;  // Middle
    }
    
    // Valid interior accesses
    if (gid > 0 && gid < buffer_size - 1) {
        output[gid] = test_pattern;
    }
}

# Run comprehensive kernel tests
openclaw test --suite kernel_memory --verbose

# Output on success:
# Test Suite: kernel_memory
#   test_bounds_null_ptr               PASSED
#   test_bounds_zero_size              PASSED
#   test_bounds_oversized              PASSED
#   test_alignment                     PASSED
#   test_concurrent_access             PASSED
# =============================
# Results: 5/5 tests PASSED

Test 4: Monitor GPU Stability Over Extended Period

# Run extended stability test (24 hours)
openclaw run --workload continuous_inference.clw \
    --duration 24h \
    --monitor-interval 60 \
    --output /var/log/openclaw/stability_test.json

# Check that no clw-gpu-crash events occurred
grep -c "clw-gpu-crash" /var/log/openclaw/stability_test.json
# Expected output: 0

5. Common Pitfalls

When addressing clw-gpu-crash errors, developers frequently encounter these pitfalls that can delay resolution:

Pitfall 1: Ignoring Driver Version Compatibility

Problem: Running OpenClaw workloads with mismatched GPU driver versions.

# Check current driver version
cat /proc/driver/nvidia/version
# NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.154.05

# Check OpenClaw requirements
openclaw --version
# OpenClaw version: 2.4.1 requires CUDA Runtime 12.1+, Driver 535.120+

# Symptoms of mismatch:
# - Kernels compile but crash immediately on launch
# - Error: CL_BUILD_PROGRAM_FAILURE despite successful compilation

Solution: Always verify driver compatibility before reporting GPU crash issues.

# Install compatible driver
sudo apt install nvidia-driver-535-server
sudo reboot

# Verify after reboot
nvidia-smi | grep -E "Driver|CUDA"

Pitfall 2: Assuming Crash Location Equals Bug Location

Problem: The crash may occur far from the actual bug in kernels with complex control flow or memory access patterns.

The GPU’s memory access violation at address 0xFFFF000012345678 might be caused by an out-of-bounds index computed thousands of instructions earlier. The instruction pointer at crash time does not necessarily indicate the source of the problem.

Solution: Use GPU memory sanitizer (ASAN) and carefully review index calculations throughout the kernel, not just at the crash point.

Pitfall 3: Insufficient Global Work Size Calculation

Problem: Launching a kernel with global work size that does not properly align to workgroup dimensions.

# Common mistake - global size not divisible by local size
global_size = (1921, 1081)  # Not divisible by typical local sizes
local_size = (16, 16)

# This causes threads at the edge to access invalid memory
kernel.launch(global_size, local_size)

Solution: Always align global work sizes:

import math

def aligned_global_size(desired, local_size):
    return tuple(
        math.ceil(d / l) * l
        for d, l in zip(desired, local_size)
    )

desired = (1921, 1081)
local_size = (16, 16)
global_size = aligned_global_size(desired, local_size)
# Result: (1920, 1088) - properly aligned

kernel.launch(global_size, local_size)

Pitfall 4: Memory Aliasing Between Buffers

Problem: Multiple OpenClaw workloads sharing overlapping physical GPU memory regions without synchronization.

# Workload A allocates 1GB starting at physical address 0x100000000
workload_a = pyopenclaw.Buffer(size_gb=1, flags=pyopenclaw.MEM_ALLOC_HOST)

# Workload B allocates 512MB - OS may give overlapping physical memory
workload_b = pyopenclaw.Buffer(size_gb=1, flags=pyopenclaw.MEM_ALLOC_HOST)

# Race condition: both workloads write to same physical memory

Solution: Use proper memory allocation flags and synchronization primitives:

# Use exclusive memory allocation
buffer = pyopenclaw.Buffer(
    size=buffer_size,
    flags=pyopenclaw.MEM_EXCLUSIVE,
    affinity=pyopenclaw.GPU_0
)

# Synchronize access across workloads
with pyopenclaw.Lock(buffer):
    kernel.launch(global_size, local_size)

Pitfall 5: Not Resetting Device After Crash

Problem: After a clw-gpu-crash, subsequent workloads fail because the GPU remains in an error state.

# Crash occurs...
workload.execute()  # May crash

# Program exits without device reset

# Next run also fails because GPU never recovered
workload2 = pyopenclaw.Workload("other_kernel")  # Still fails

Solution: Always reset the GPU device after a crash:

import pyopenclaw

def execute_with_recovery(workload):
    try:
        return workload.execute()
    except pyopenclaw.GPUCrashError as e:
        print(f"Crash detected: {e}")
        print("Resetting GPU device...")
        
        # Reset the device
        device = pyopenclaw.get_device(workload.device_id)
        device.reset()
        
        # Wait for recovery
        device.wait_for_recovery(timeout_ms=5000)
        
        raise pyopenclaw.RecoverableError("GPU reset - retry workload")

Pitfall 6: Misconfiguring Memory Limits

Problem: Setting memory limits too close to physical GPU memory capacity causes allocation failures that manifest as crashes.

# openclaw.yaml - problematic configuration
memory:
  max_allocation_mb: 15000  # Tesla T4 has 15360 MB - almost all memory
  reserve_mb: 128            # Not enough reserve for OS/TVM overhead

Solution: Leave adequate headroom:

# openclaw.yaml - better configuration
memory:
  max_allocation_mb: 14336  # Leave ~1GB for system overhead
  reserve_mb: 512            # Reserve for temporary allocations
  paging_enabled: true       # Allow paging if needed

The following error codes are commonly related to clw-gpu-crash and may occur before, after, or instead of this error:

Error Code	Relationship	Description
`clw-out-of-memory`	Often precedes crash	GPU memory allocation failure may lead to null pointer access
`clw-kernel-timeout`	Similar symptoms	Watchdog timeout may trigger crash recovery mechanisms
`clw-device-lost`	Follows crash	Device enters unrecoverable state after crash
`clw-invalid-handle`	Related root cause	Invalid buffer/kernel handles can cause memory access violations
`clw-ecc-error`	Hardware-related	ECC memory errors on Tesla/Quadro GPUs can cause similar crashes
`clw-pcie-error`	Hardware-related	PCIe bus errors can manifest as memory access violations
`clw-thermal-throttle`	Environmental	Thermal throttling may cause unstable kernel execution
`clw-driver-crash`	Same category	NVIDIA/AMD driver crashes have overlapping symptoms

Error Transition Examples

Sequence 1: Out-of-memory → GPU Crash

[INFO] Initializing workload
[ERROR] clw-out-of-memory: Cannot allocate 8192 MB (only 4096 MB available)
[WARN] Falling back to smaller buffer
[ERROR] clw-gpu-crash: Buffer pointer NULL passed to kernel

Sequence 2: GPU Crash → Device Lost

[ERROR] clw-gpu-crash: Memory access violation at 0x12345678
[ERROR]   Context marked for isolation
[ERROR] clw-device-lost: Device NVIDIA Tesla T4 (ID:0) has entered unrecoverable state
[WARN]   GPU recovery required - system reboot recommended

import pyopenclaw

def comprehensive_error_handler(func):
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except pyopenclaw.OpenClawError as e:
            error_code = e.code
            
            if error_code == "clw-out-of-memory":
                print("Memory issue detected - optimizing allocation")
                # Optimize and retry
                
            elif error_code == "clw-kernel-timeout":
                print("Timeout detected - checking kernel efficiency")
                # Profile kernel and optimize
                
            elif error_code == "clw-gpu-crash":
                print("GPU crash detected - gathering diagnostics")
                # Dump state and reset
                
            elif error_code == "clw-device-lost":
                print("Device lost - requiring manual intervention")
                # Notify operator
                
            raise  # Re-raise after handling
            
    return wrapper

1. Symptoms

Observable Symptoms

Secondary Symptoms

2. Root Cause

Primary Root Causes

Underlying Technical Mechanism

3. Step-by-Step Fix

Step 1: Enable Detailed GPU Logging

Step 2: Parse the Crash Dump

Step 3: Add Bounds Checking to Kernel Code

Step 4: Implement Proper Pointer Validation in Host Code

Step 5: Fix Shared Memory Allocation Issues

Step 6: Add Synchronization Barriers

Step 7: Verify GPU Health

4. Verification

Test 1: Run Workload with Error Recovery

Test 2: Validate with Memory Sanitizer

Test 3: Stress Test Kernel Execution

Test 4: Monitor GPU Stability Over Extended Period

5. Common Pitfalls

Pitfall 1: Ignoring Driver Version Compatibility

Pitfall 2: Assuming Crash Location Equals Bug Location

Pitfall 3: Insufficient Global Work Size Calculation

Pitfall 4: Memory Aliasing Between Buffers

Pitfall 5: Not Resetting Device After Crash

Pitfall 6: Misconfiguring Memory Limits

6. Related Errors

Error Transition Examples

Debugging Related Errors