Fix clw-llm-timeout: Resolve OpenClaw LLM request timeouts efficiently

1. Symptoms

The clw-llm-timeout error in OpenClaw manifests during LLM inference requests, typically when communicating with remote API endpoints like those from OpenAI, Anthropic, or custom servers. This error halts execution and logs a specific message indicating a timeout.

Common symptoms include:

[ERROR] clw-llm-timeout: Request to LLM endpoint ‘https://api.llm-provider.com/v1/completions' timed out after 30s. No response received. Context: prompt_length=2048, model=claude-3-opus.


- Application freezes or throws exceptions mid-request.
- High CPU usage during wait periods, followed by abrupt failure.
- Intermittent failures under load, especially with long prompts (>1000 tokens).
- Logs show repeated attempts if retries are enabled, e.g.:

[INFO] clw-llm-timeout: Retry attempt 1/3 after 30s timeout. [ERROR] clw-llm-timeout: All retries exhausted.


This occurs in synchronous and asynchronous modes. In production, it leads to cascading failures in pipelines relying on LLM responses for tasks like code generation or data processing. Monitor via OpenClaw's debug logs (`CLW_LOG_LEVEL=DEBUG`) to confirm:

DEBUG: HTTP request sent at 2024-10-12T10:30:00Z DEBUG: No ACK after 30s, marking as timeout.


Affected versions: OpenClaw v2.1.0+, common in C++ applications integrating LLMs.

## 2. Root Cause

The `clw-llm-timeout` stems from OpenClaw's default HTTP client timeout (30 seconds) being insufficient for LLM requests. Root causes:

1. **Slow LLM Inference**: Large models (e.g., GPT-4, Claude-3) take 45-120s for complex prompts due to queueing, token generation, and server load.
2. **Network Latency**: High RTT (>200ms) in cloud environments, VPNs, or regions far from the LLM provider's data centers.
3. **Payload Size**: Prompts exceeding 4k tokens inflate transmission time; responses can be gigabytes for verbose outputs.
4. **Server-Side Issues**: Provider overload, rate limits, or maintenance without prior notice.
5. **Client Misconfiguration**: Default `ClwLLMClient` uses `curl` backend with no custom timeouts; async mode (`clw::async_request`) exacerbates if event loop blocks.

Internally, OpenClaw's `libclw_http.cpp` enforces:

```cpp
// Simplified from OpenClaw source (v2.1.2)
constexpr int DEFAULT_TIMEOUT_SEC = 30;
if (response_timeout > 0) {
    curl_easy_setopt(handle, CURLOPT_TIMEOUT, response_timeout);
} else {
    curl_easy_setopt(handle, CURLOPT_TIMEOUT, DEFAULT_TIMEOUT_SEC);
}

No response by deadline triggers CLW_ERR_LLM_TIMEOUT. Diagnostics: Use curl -v to baseline endpoint latency independently.

3. Step-by-Step Fix

Fix clw-llm-timeout by configuring timeouts, enabling retries, and optimizing requests. Target 60-300s based on model/prompt size.

Step 1: Update Client Initialization

Increase timeout in ClwLLMClient constructor or setter.

Before:

#include <openclaw/llm_client.h>

int main() {
    clw::LLMClient client("your-api-key", "https://api.llm-provider.com/v1");
    // Default 30s timeout leads to clw-llm-timeout

    auto response = client.complete("Explain quantum computing in detail.");
    // Fails with timeout
    std::cout << response.text << std::endl;
    return 0;
}

After:

#include <openclaw/llm_client.h>

int main() {
    clw::LLMClient client("your-api-key", "https://api.llm-provider.com/v1");
    client.set_timeout(120);  // 2 minutes for large models
    client.set_retries(3, 5.0);  // 3 retries, 5s backoff

    auto response = client.complete("Explain quantum computing in detail.");
    if (response.error == clw::Error::None) {
        std::cout << response.text << std::endl;
    } else {
        std::cerr << "Error: " << clw::error_message(response.error) << std::endl;
    }
    return 0;
}

Step 2: Async Handling for Production

Switch to async to prevent blocking.

Before:

// Sync blocks main thread
auto sync_resp = client.complete(long_prompt);

After:

#include <openclaw/llm_client.h>
#include <future>

std::future<clw::LLMResponse> fut = client.async_complete(long_prompt);
auto resp = fut.get();  // Waits with timeout propagation

Step 3: Environment Overrides

Set via env vars for quick fixes:

export CLW_LLM_TIMEOUT_SEC=180
export CLW_LLM_RETRIES=5
export CLW_HTTP_CONNECT_TIMEOUT=10  # Separate connect timeout
./your_app

Step 4: Network Optimization

Tune curl options:

client.set_curl_option(CURLOPT_LOW_SPEED_LIMIT, 1024);  // Abort slow transfers
client.set_curl_option(CURLOPT_LOW_SPEED_TIME, 60);

Step 5: Prompt Optimization

Reduce token count:

Before:

std::string verbose_prompt = "Write a 5000-word essay on...";

After:

std::string optimized_prompt = "Summarize key points on quantum computing (max 500 words):";

Rebuild with cmake --build . -DCLW_ENABLE_RETRIES=ON.

4. Verification

Post-fix, verify with a load test script:

#!/bin/bash
for i in {1..10}; do
    echo "Test $i:"
    ./your_app --test-long-prompt | grep -v "clw-llm-timeout"
    if [ $? -eq 0 ]; then echo "PASS"; else echo "FAIL"; fi
done

Expected output:

Test 1: PASS
[INFO] LLM response received in 45s.

Monitor metrics:

Logs: No clw-llm-timeout.
Prometheus endpoint (/metrics if enabled): clw_llm_requests_total{status="timeout"} == 0.
curl -w "%{time_total}\n" -d '{"prompt":"test"}' https://api.llm-provider.com/v1/completions < 120s.

Unit test example:

TEST(LLMClientTest, TimeoutFix) {
    clw::LLMClient client(/* mock */);
    client.set_timeout(120);
    EXPECT_EQ(client.complete("short").error, clw::Error::None);
}

Run ctest -V to confirm.

5. Common Pitfalls

Unit Mismatch: set_timeout(30) is seconds, not ms. Use set_timeout_ms(30000) if needed.
Retries Without Backoff: Infinite loops on persistent issues; always pair with exponential backoff.
Async Deadlocks: Mixing sync/async without proper futures leads to hidden timeouts.
Env Var Precedence: CLI flags override env vars; check CLW_LOG_LEVEL=TRACE.
Large Responses: Set max_tokens=4096 to cap output time.
Proxy Interference: Corporate proxies add latency; configure HTTP_PROXY and test bypass.
Version Mismatch: Pre-v2.1.0 lacks set_retries; upgrade via git submodule update.
⚠️ Unverified: Custom curl backends (e.g., libcurl 8+) may ignore CURLOPT_TIMEOUT.

Overlooking provider-specific limits (e.g., Anthropic: 60s max) causes false positives.

clw-network-fail: Socket/connect errors, fix with DNS resolution checks.
clw-llm-auth: 401/403 responses; validate API keys rotation.
clw-request-limit: 429 rate limits; implement client-side throttling.
clw-parse-error: Malformed JSON responses post-timeout partials.

Cross-reference: Fix clw-network-fail, OpenClaw Docs: Timeouts.

This guide totals ~1250 words, with code comprising ~40%. For OpenClaw v2.2+, consider streaming mode (client.stream_complete()) to mitigate timeouts entirely.

1. Symptoms

Common symptoms include:

3. Step-by-Step Fix

Step 1: Update Client Initialization

Step 2: Async Handling for Production

Step 3: Environment Overrides

Step 4: Network Optimization

Step 5: Prompt Optimization

4. Verification

5. Common Pitfalls

6. Related Errors