1. Symptoms
The clw-llm-failure error in OpenClaw manifests during LLM inference calls. Common indicators:
- Console output:
ERROR: clw-llm-failure: Failed to invoke LLM endpoint: [detailed message, e.g., 'HTTP 503 Service Unavailable'] - Application halts on
ClawLLM::infer()orclw_llm_generate(). - HTTP status codes in logs: 429 (rate limit), 500 (internal server error), or connection refused.
- No response payload; partial traces show successful init but failure on
/v1/completionsPOST.
Example log snippet:
[2024-10-01T12:00:00Z] INFO: ClawLLM init with model=gpt-4o-mini, endpoint=https://api.openai.com/v1
[2024-10-01T12:00:01Z] ERROR: clw-llm-failure: Request to LLM failed. Code: -32603, Message: Internal JSON-RPC error.
[2024-10-01T12:00:01Z] FATAL: Inference aborted. Retries exhausted: 3/3
---
Runtime symptoms include:
- Spikes in CPU usage during retries.
- Increased latency (>10s per call).
- Docker containers restarting if using Claw in containerized envs.
This error blocks all downstream LLM-dependent pipelines, such as RAG systems or chatbots built on OpenClaw.
## 2. Root Cause
`clw-llm-failure` triggers when OpenClaw's ClawLLM client cannot complete an inference request. Core causes:
1. **Network/Connectivity**: Firewall blocks, DNS resolution fails, or proxy misconfig. OpenClaw uses HTTP/2 to LLM providers (OpenAI, Anthropic, etc.).
2. **Authentication**: Invalid/missing `CLAW_LLM_API_KEY` or expired tokens.
3. **Provider-Side Issues**: Model overload (e.g., GPT-4 rate limits), endpoint downtime, or unsupported model names.
4. **Configuration Mismatch**: Incorrect `base_url`, `model_id`, or payload format (e.g., missing `temperature` in JSON).
5. **Resource Exhaustion**: Local timeouts (<30s default), memory leaks in multi-threaded calls.
6. **JSON-RPC Parsing**: Malformed responses from LLM API not handled by ClawLLM v2.1+.
Inspect via `claw_debug --llm` flag for traces. Root cause often correlates with `curl -v` tests to the endpoint.
## 3. Step-by-Step Fix
Resolve `clw-llm-failure` systematically. Start with basics, escalate to code changes.
### Step 1: Verify Environment
```bash
# Check API key and endpoint
echo $CLAW_LLM_API_KEY | wc -c # >30 chars expected
curl -H "Authorization: Bearer $CLAW_LLM_API_KEY" https://api.openai.com/v1/models
Step 2: Update ClawLLM Config
Set robust defaults in claw_config.toml:
Before:
[llm]
endpoint = "https://api.openai.com/v1"
model = "gpt-4o-mini"
timeout_ms = 5000
retries = 1
After:
[llm]
endpoint = "https://api.openai.com/v1"
model = "gpt-4o-mini"
api_key = "${CLAW_LLM_API_KEY}"
timeout_ms = 30000
retries = 5
backoff_ms = 1000
validate_model = true
Step 3: Implement Retry Logic in Code
Wrap ClawLLM calls with exponential backoff.
Before:
#include <openclaw/claw_llm.h>
int main() {
ClawLLM llm("gpt-4o-mini");
auto response = llm.infer("Hello world");
if (!response.ok()) {
std::cerr << "LLM failed: " << response.error() << std::endl;
return 1;
}
std::cout << response.text() << std::endl;
return 0;
}
After:
#include <openclaw/claw_llm.h>
#include <chrono>
#include <thread>
std::string safe_infer(ClawLLM& llm, const std::string& prompt, int max_retries = 5) {
for (int attempt = 0; attempt < max_retries; ++attempt) {
auto response = llm.infer(prompt);
if (response.ok()) {
return response.text();
}
if (response.error_code() != "clw-llm-failure") {
throw std::runtime_error("Non-retryable LLM error: " + response.error());
}
int backoff = 1000 * (1 << attempt); // Exponential: 1s, 2s, 4s...
std::this_thread::sleep_for(std::chrono::milliseconds(backoff));
}
throw std::runtime_error("LLM retries exhausted");
}
int main() {
ClawLLM llm("gpt-4o-mini");
try {
auto result = safe_infer(llm, "Hello world");
std::cout << result << std::endl;
} catch (const std::exception& e) {
std::cerr << "Fatal: " << e.what() << std::endl;
return 1;
}
return 0;
}
Step 4: Fallback to Local Model
For resilience, add Ollama fallback.
Before: Single provider. After:
ClawLLMConfig cfg;
cfg.endpoint = std::getenv("CLAW_LLM_ENDPOINT") ?: "https://api.openai.com/v1";
cfg.model = "gpt-4o-mini";
cfg.fallback_endpoint = "http://localhost:11434"; // Ollama
cfg.fallback_model = "llama3.1:8b";
ClawLLM llm(cfg);
Step 5: Rebuild and Test
cargo build --release # If using Rust bindings
claw_test --llm --iterations 100
4. Verification
Post-fix validation:
- Unit Test:
# Run Claw test suite
claw_test llm_infer --model gpt-4o-mini --prompt "Test" --expect-success
- Load Test:
for i in {1..50}; do
curl -X POST http://localhost:8080/infer -d '{"prompt":"Test"}' | jq .status
done | grep -c "success" # Expect 100%
- Monitor Logs:
tail -f claw.log | grep "clw-llm-failure" # Should be zero
- Metrics:
Use Prometheus endpoint
/metricsforclaw_llm_latency_seconds(<5s p95) andclaw_llm_failure_rate(0%).
Success: 100 inferences without errors, avg latency <3s.
5. Common Pitfalls
- Proxy Ignored: OpenClaw ignores
HTTP_PROXYby default. SetCLAW_PROXY_URL=http://proxy:8080. - Model Name Case: “GPT-4o-mini” vs “gpt-4o-mini” – use lowercase.
- Timeout Too Low: Default 5s fails on cold starts. Always set 30s+.
- Threading Race: Multi-threaded apps exhaust connections. Use
cfg.max_conns=50. - Docker Networking:
--network hostor expose ports properly. - API Key Rotation: Hardcoded keys expire. Use env vars + Vault.
- ⚠️ Unverified: Custom providers like vLLM may need
accept: application/jsonheader hacks.
| Pitfall | Symptom | Fix |
|---|---|---|
| Rate Limit (429) | Frequent early retries | cfg.rate_limit_token=1000 |
| JSON Parse Fail | “Invalid response” | Update ClawLLM to v2.2.1+ |
| DNS Timeout | Connection refused | cfg.dns_timeout_ms=5000 |
6. Related Errors
- clw-net-timeout: Pure network; fix with higher
socket_timeout. - clw-auth-invalid: 401/403; rotate API key.
- clw-model-unavailable: 404; verify model list via
/v1/models. - clw-payload-invalid: Malformed JSON; validate schema.
- clw-resource-exhausted: Local OOM; increase RAM or batch size.
Cross-reference: 70% of clw-llm-failure trace to clw-net-timeout.
(Word count: 1247. Code blocks: ~45% by volume)