Fix clw-gpu-unreachable: GPU Device Cannot Be Reached by OpenClaw Runtime

1. Symptoms

The clw-gpu-unreachable error manifests when the OpenClaw runtime establishes an initial connection to the host system but cannot communicate with or access a configured GPU device. This error typically occurs during workload initialization or when attempting to dispatch compute kernels to GPU hardware.

Typical error message:

[OpenClaw Runtime Error] clw-gpu-unreachable
Failed to establish communication channel with GPU device: NVIDIA Tesla V100 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
Target device is either offline, inaccessible, or has been removed from the compute node.
Error Code: clw-gpu-unreachable
Timestamp: 2024-01-15T10:23:45.123Z
Runtime Version: openclaw-2.4.1

Additional observable symptoms:

Compute workloads queue indefinitely without execution
GPU utilization metrics show zero activity despite workload submission
The clw devices list command shows the GPU in a “unreachable” or “offline” state
Driver-level diagnostics may report “GPU has fallen off the bus”
Periodic timeout errors in the runtime logs preceding the unreachability status
Memory allocation attempts fail with cascading error codes
Kernel compilation succeeds but dispatch fails immediately

Affected operations:

clw run - Workload execution
clw submit - Job submission to GPU queue
clw mem-alloc - Device memory allocation
clw kernel-dispatch - Kernel execution requests
clw profile - Performance profiling sessions

2. Root Cause

The clw-gpu-unreachable error occurs when the OpenClaw runtime cannot maintain or establish a functional communication channel with a GPU device. This breakdown in communication can stem from several underlying causes:

Hardware-Level Issues:

The GPU device may have physically disconnected from the PCIe bus, experienced a power failure, or encountered a thermal shutdown condition. GPUs that overheat (typically above 110°C for NVIDIA consumer cards, or above 100°C for data center GPUs) will throttle performance and eventually shut down protection mechanisms that manifest as unreachability to the runtime. PCIe link degradation can also cause intermittent connectivity that eventually results in complete communication failure.

Driver and Firmware Problems:

Outdated, corrupted, or incompatible GPU drivers frequently cause unreachability errors. NVIDIA drivers with version mismatches against the CUDA toolkit version used by OpenClaw will fail to establish proper communication. Firmware bugs in the GPU’s VBIOS can also cause the device to become unresponsive while still appearing in enumeration.

Runtime State Corruption:

If the GPU device was previously used by another process that crashed or terminated abnormally, the device may be left in an inconsistent state. NVIDIA’s Persistence Daemon (nvidia-persistenced) helps mitigate this, but configuration errors can leave devices locked or in a zombie state that prevents new runtime connections.

System Resource Exhaustion:

Insufficient PCIe resources (BAR memory space), NUMA node misalignment, or IRQ conflicts can prevent the runtime from mapping device memory or establishing DMA transfers. When the runtime cannot map the device’s memory regions, it cannot communicate and reports unreachability.

Security and Virtualization Barriers:

In containerized or virtualized environments, improper GPU passthrough configuration (VFIO, NVIDIA Container Toolkit misconfiguration, or hypervisor GPU assignment failures) can isolate the GPU from the runtime’s access. IOMMU grouping issues, incorrect device permissions, or missing kernel modules can all produce this error.

3. Step-by-Step Fix

Step 1: Verify GPU Visibility at Hardware Level

Before addressing software configuration, confirm the GPU is visible to the system:

# Check PCIe enumeration
lspci | grep -iE "vga|3d|display|nvidia"

# Expected output for NVIDIA GPU:
# 01:00.0 VGA compatible controller: NVIDIA Corporation GV100 [Tesla V100] (rev a1)
# 01:00.1 Audio device: NVIDIA Corporation GV100 HDMI/Displayport GSP

# Check NVIDIA GPU status
nvidia-smi

# Verify GPU state and temperature
nvidia-smi -q

# Confirm persistence daemon is running
systemctl status nvidia-persistenced

If the GPU does not appear in lspci, this indicates a hardware or PCIe configuration problem that must be resolved before proceeding.

Step 2: Restart the GPU Device

For transient GPU state issues, a driver-level reset may restore connectivity:

# Option A: Runtime D3 state transition (non-destructive)
sudo nvidia-smi --gpu-reset

# Option B: Module reload for driver-level reset
sudo rmmod nvidia
sudo modprobe nvidia

# Option C: Full device reset via NVIDIA plugin
sudo nvidia-smi -i 0 --reset-aperture

Before:

[OpenClaw Runtime]
GPU Status: UNREACHABLE
Driver Version: 535.154.05
CUDA Version: 12.2
Last Communication: 2024-01-15T09:45:12Z
Error: clw-gpu-unreachable

After:

[OpenClaw Runtime]
GPU Status: READY
Driver Version: 535.154.05
CUDA Version: 12.2
Last Communication: 2024-01-15T10:25:00Z
Connected: true

Step 3: Verify CUDA and Driver Compatibility

Ensure the CUDA toolkit version aligns with the installed driver:

# Check installed driver version
nvidia-smi | grep "Driver Version"

# Check CUDA toolkit version
nvcc --version

# Verify OpenClaw's expected CUDA runtime
cat /etc/openclaw/openclaw.conf | grep cuda_version

# Install matching CUDA if version mismatch detected
sudo apt install cuda-runtime-12-2 cuda-drivers-535

Step 4: Configure Proper Device Permissions

GPU device access requires correct permissions for the runtime process:

# Add current user to video group
sudo usermod -aG video $USER
sudo usermod -aG render $USER

# Set correct device permissions
sudo chmod 666 /dev/nvidia*
sudo chmod 666 /dev/nvidia-uvm*

# Create udev rules for persistent permissions
sudo tee /etc/udev/rules.d/70-nvidia.rules << 'EOF'
KERNEL=="nvidia", MODE="0666", GROUP="video"
KERNEL=="nvidia_modeset", MODE="0666", GROUP="video"
KERNEL=="nvidia_uvm", MODE="0666", GROUP="video"
KERNEL=="nvidia-uvm-tools", MODE="0666", GROUP="video"
EOF

sudo udevadm control --reload-rules
sudo udevadm trigger

Step 5: Configure OpenClaw for Correct Device Targeting

Update the OpenClaw runtime configuration to target the correct GPU:

# List available devices with OpenClaw
clw devices list --verbose

# Create/update configuration with correct device UUID
clw config set default.gpu.uuid GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

# Alternative: Target by device index
clw config set default.gpu.index 0

Before:

# /etc/openclaw/workload.yaml
runtime:
  gpu:
    device_selection: auto
    timeout_ms: 30000

After:

# /etc/openclaw/workload.yaml
runtime:
  gpu:
    device_selection: explicit
    device_uuid: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
    timeout_ms: 60000
    retry_attempts: 3
    retry_delay_ms: 5000

Step 6: Restart OpenClaw Runtime Services

After configuration changes, restart the runtime:

# Stop runtime services
sudo systemctl stop openclaw-runtime
sudo systemctl stop openclaw-scheduler

# Clear runtime state
sudo rm -rf /var/lib/openclaw/runtime_state/*
sudo rm -rf /tmp/openclaw-*

# Start runtime services
sudo systemctl start openclaw-runtime
sudo systemctl start openclaw-scheduler

# Verify service status
sudo systemctl status openclaw-runtime

4. Verification

After applying the fix, verify GPU reachability through multiple validation steps:

Runtime-Level Verification:

# Check device status via OpenClaw
clw devices list

# Expected output:
# +----+------------------------------+----------+--------+
# | ID | UUID                          | Status   | Memory |
# +----+------------------------------+----------+--------+
# | 0  | GPU-xxxxxxxx-xxxx-xxxx-xxxx   | REACHABLE | 32GB  |
# +----+------------------------------+----------+--------+

# Test workload execution
clw run --image=openclaw/test:latest --command="nvidia-smi"

# Verify kernel execution capability
clw kernel-verify --device=0

Hardware-Level Verification:

# Confirm GPU is responsive
nvidia-smi

# Check GPU utilization (should show activity if workload is running)
watch -n 1 nvidia-smi

# Verify ECC status (for data center GPUs)
nvidia-smi -q | grep -i ecc

Application-Level Verification:

Deploy a test compute workload to confirm end-to-end functionality:

# Create a simple test workload
cat > /tmp/gpu_test.yaml << 'EOF'
name: gpu-reachability-test
runtime: openclaw
gpu:
  required: true
  memory_mb: 1024
kernel:
  source: |
    __global__ void test_kernel(float* data) {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        data[idx] = idx * 1.0f;
    }
  entry: test_kernel
input:
  size_mb: 64
EOF

# Submit test workload
clw submit /tmp/gpu_test.yaml --wait

# Verify execution completed successfully
clw jobs list --status=completed --limit=5

5. Common Pitfalls

Ignoring the Persistence Daemon:

Failing to enable nvidia-persistenced causes GPU state loss when no processes are using the device. This frequently results in unreachability errors on subsequent workload submissions, especially in multi-tenant or batch-processing environments.

# Enable persistence daemon (prevents device state loss)
sudo systemctl enable nvidia-persistenced
sudo systemctl start nvidia-persistenced

Mixing CUDA Toolkit Versions:

Using a CUDA toolkit version that differs significantly from the installed driver causes API incompatibilities. OpenClaw may successfully enumerate the device but fail during kernel compilation or dispatch due to version mismatches.

# Always match CUDA toolkit to driver capabilities
# Driver 535.x supports CUDA 12.x
# Driver 525.x supports CUDA 11.x through 12.x
# Use nvidia-smi to determine maximum supported CUDA version

Insufficient BAR Memory Allocation:

GPUs require mapped BAR (Base Address Register) memory regions for host-device communication. Insufficient allocation in BIOS settings causes the driver to fail device initialization.

# Check BAR memory allocation in dmesg
dmesg | grep -i nvidia | grep -i bar

# Adjust BIOS settings: "Above 4G Decoding" should be enabled
# Resize BAR option if available (PCIe Resizable BAR support)

Container Runtime GPU Passthrough Misconfiguration:

When running OpenClaw inside containers, incorrect NVIDIA Container Toolkit configuration isolates the GPU from the runtime. Missing --gpus flags or misconfigured docker-compose.yml files prevent device access.

# Correct docker-compose.yml for GPU access
services:
  openclaw:
    image: openclaw/runtime:2.4.1
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
    volumes:
      - /var/lib/openclaw:/var/lib/openclaw

IOMMU Grouping Conflicts in Virtualized Environments:

Passthrough GPUs must be in isolated IOMMU groups to prevent conflicts. Sharing an IOMMU group with other devices causes access failures and unreachability.

# Check IOMMU grouping
cat /sys/kernel/iommu_groups/*/devices/* | grep -i nvidia

# Each GPU should be its own device in its own group
# If grouped with other devices, configure VFIO to isolate properly

Race Conditions During Device Enumeration:

Rapid device state changes during boot or hot-plug events cause the runtime to query the device before it reaches a stable state. Implementing retry logic with exponential backoff in configuration mitigates this.

# Add retry configuration to openclaw.conf
runtime:
  gpu:
    connection_retry:
      max_attempts: 5
      initial_delay_ms: 1000
      backoff_multiplier: 2
      max_delay_ms: 30000

clw-gpu-init-failed

This error occurs earlier in the initialization sequence than clw-gpu-unreachable. It indicates the runtime failed to load the GPU driver interface or allocate necessary resources for device communication. Unlike unreachability, which implies transient connectivity loss, init failures typically indicate configuration or driver installation problems.

[OpenClaw Runtime Error] clw-gpu-init-failed
Failed to load NVIDIA driver interface: libcuda.so.1 not found

clw-device-busy

When the GPU is reachable but currently locked by another process, clw-device-busy results. This can appear as a transient condition that precedes clw-gpu-unreachable if the blocking process prevents the runtime from maintaining its communication channel.

[OpenClaw Runtime Error] clw-device-busy
GPU device is exclusively locked by process: python (PID: 12345)

clw-runtime-disconnected

This broader error indicates the OpenClaw runtime host process lost its internal state connection, which may encompass GPU unreachability or encompass broader system-level communication failures with the runtime daemon.

[OpenClaw Runtime Error] clw-runtime-disconnected
Loss of connection to runtime daemon at unix:///var/run/openclaw/daemon.sock

clw-memory-alloc-failed

GPU memory allocation failures often cascade from reachability issues. When the device is unreachable, all memory operations fail, producing this secondary error. Resolving the unreachability typically resolves memory allocation issues.

[OpenClaw Runtime Error] clw-memory-alloc-failed
Failed to allocate 8192MB on device GPU-xxxxxxxx: device unreachable

clw-kernel-execution-failed

Kernel dispatch failures may occur when the GPU transitions to unreachable state during execution. This error typically includes context information about which kernel was executing when the device became unreachable.

[OpenClaw Runtime Error] clw-kernel-execution-failed
Kernel "matrix_multiply" failed with error: GPU device unreachable
CUDA error: cudaErrorDevicesUnavailable