diff --git a/.claude/skills/higress-clawdbot-integration/SKILL.md b/.claude/skills/higress-clawdbot-integration/SKILL.md index e44a6b43f..b03667e38 100644 --- a/.claude/skills/higress-clawdbot-integration/SKILL.md +++ b/.claude/skills/higress-clawdbot-integration/SKILL.md @@ -420,29 +420,12 @@ Selected plugin registry: higress-registry.us-west-1.cr.aliyuncs.com ## Troubleshooting -### Container fails to start -- Check Docker is running: `docker info` -- Check port availability: `netstat -tlnp | grep 8080` -- View container logs: `docker logs higress-ai-gateway` +For detailed troubleshooting guides, see [TROUBLESHOOTING.md](references/TROUBLESHOOTING.md). -### Gateway not responding -- Check container status: `docker ps -a` -- Verify port mapping: `docker port higress-ai-gateway` -- Test locally: `curl http://localhost:8080/v1/models` - -### Plugin not recognized -- Verify plugin is installed at `~/.clawdbot/extensions/higress-ai-gateway` or `~/.openclaw/extensions/higress-ai-gateway` -- Check `package.json` contains correct extension field (`clawdbot.extensions` or `openclaw.extensions`) -- Restart Clawdbot/OpenClaw after installation - -### Auto-routing not working -- Confirm `higress/auto` is in your model list -- Check routing rules exist: `./get-ai-gateway.sh route list` -- Verify default model is configured -- Check gateway logs for routing decisions - -### Timezone detection fails -- Manually check timezone: `timedatectl show --property=Timezone --value` -- Or check `/etc/timezone` file -- Fallback to default Hangzhou mirror if detection fails -- Consider manually setting `IMAGE_REPO` environment variable if auto-detection is incorrect +Common issues: +- **Container fails to start**: Check Docker status, port availability, and container logs +- **"too many open files" error**: Increase `fs.inotify.max_user_instances` to 8192 +- **Gateway not responding**: Verify container status and port mapping +- **Plugin not recognized**: Check installation path and restart runtime +- **Auto-routing not working**: Verify model list and routing rules +- **Timezone detection fails**: Manually set `IMAGE_REPO` environment variable diff --git a/.claude/skills/higress-clawdbot-integration/references/TROUBLESHOOTING.md b/.claude/skills/higress-clawdbot-integration/references/TROUBLESHOOTING.md new file mode 100644 index 000000000..6882eccf5 --- /dev/null +++ b/.claude/skills/higress-clawdbot-integration/references/TROUBLESHOOTING.md @@ -0,0 +1,325 @@ +# Higress AI Gateway - Troubleshooting + +Common issues and solutions for Higress AI Gateway deployment and operation. + +## Container Issues + +### Container fails to start + +**Check Docker is running:** +```bash +docker info +``` + +**Check port availability:** +```bash +netstat -tlnp | grep 8080 +``` + +**View container logs:** +```bash +docker logs higress-ai-gateway +``` + +### Gateway not responding + +**Check container status:** +```bash +docker ps -a +``` + +**Verify port mapping:** +```bash +docker port higress-ai-gateway +``` + +**Test locally:** +```bash +curl http://localhost:8080/v1/models +``` + +## File System Issues + +### "too many open files" error from API server + +**Symptom:** +``` +panic: unable to create REST storage for a resource due to too many open files, will die +``` +or +``` +command failed err="failed to create shared file watcher: too many open files" +``` + +**Root Cause:** + +The system's `fs.inotify.max_user_instances` limit is too low. This commonly occurs on systems with many Docker containers, as each container can consume inotify instances. + +**Check current limit:** +```bash +cat /proc/sys/fs/inotify/max_user_instances +``` + +Default is often 128, which is insufficient when running multiple containers. + +**Solution:** + +Increase the inotify instance limit to 8192: + +```bash +# Temporarily (until next reboot) +sudo sysctl -w fs.inotify.max_user_instances=8192 + +# Permanently (survives reboots) +echo "fs.inotify.max_user_instances = 8192" | sudo tee -a /etc/sysctl.conf +sudo sysctl -p +``` + +**Verify:** +```bash +cat /proc/sys/fs/inotify/max_user_instances +# Should output: 8192 +``` + +**Restart the container:** +```bash +docker restart higress-ai-gateway +``` + +**Additional inotify tunables** (if still experiencing issues): +```bash +# Increase max watches per user +sudo sysctl -w fs.inotify.max_user_watches=524288 + +# Increase max queued events +sudo sysctl -w fs.inotify.max_queued_events=32768 +``` + +To make these permanent as well: +```bash +echo "fs.inotify.max_user_watches = 524288" | sudo tee -a /etc/sysctl.conf +echo "fs.inotify.max_queued_events = 32768" | sudo tee -a /etc/sysctl.conf +sudo sysctl -p +``` + +## Plugin Issues + +### Plugin not recognized + +**Verify plugin installation:** + +For Clawdbot: +```bash +ls -la ~/.clawdbot/extensions/higress-ai-gateway +``` + +For OpenClaw: +```bash +ls -la ~/.openclaw/extensions/higress-ai-gateway +``` + +**Check package.json:** + +Ensure `package.json` contains the correct extension field: +- Clawdbot: `"clawdbot.extensions"` +- OpenClaw: `"openclaw.extensions"` + +**Restart the runtime:** +```bash +# Restart Clawdbot gateway +clawdbot gateway restart + +# Or OpenClaw gateway +openclaw gateway restart +``` + +## Routing Issues + +### Auto-routing not working + +**Confirm model is in list:** +```bash +# Check if higress/auto is available +clawdbot models list | grep "higress/auto" +``` + +**Check routing rules exist:** +```bash +./get-ai-gateway.sh route list +``` + +**Verify default model is configured:** +```bash +./get-ai-gateway.sh config list +``` + +**Check gateway logs:** +```bash +docker logs higress-ai-gateway | grep -i routing +``` + +**View access logs:** +```bash +tail -f ./higress/logs/access.log +``` + +## Configuration Issues + +### Timezone detection fails + +**Manually check timezone:** +```bash +timedatectl show --property=Timezone --value +``` + +**Or check timezone file:** +```bash +cat /etc/timezone +``` + +**Fallback behavior:** +- If detection fails, defaults to Hangzhou mirror +- Manual override: Set `IMAGE_REPO` environment variable + +**Manual repository selection:** +```bash +# For China/Asia +IMAGE_REPO="higress-registry.cn-hangzhou.cr.aliyuncs.com/higress/all-in-one" + +# For Southeast Asia +IMAGE_REPO="higress-registry.ap-southeast-7.cr.aliyuncs.com/higress/all-in-one" + +# For North America +IMAGE_REPO="higress-registry.us-west-1.cr.aliyuncs.com/higress/all-in-one" + +# Use in deployment +IMAGE_REPO="$IMAGE_REPO" ./get-ai-gateway.sh start --non-interactive ... +``` + +## Performance Issues + +### Slow image downloads + +**Check selected repository:** +```bash +echo $IMAGE_REPO +``` + +**Manually select closest mirror:** + +See [Configuration Issues → Timezone detection fails](#timezone-detection-fails) for manual repository selection. + +### High memory usage + +**Check container stats:** +```bash +docker stats higress-ai-gateway +``` + +**View resource limits:** +```bash +docker inspect higress-ai-gateway | grep -A 10 "HostConfig" +``` + +**Set memory limits:** +```bash +# Stop container +./get-ai-gateway.sh stop + +# Manually restart with limits +docker run -d \ + --name higress-ai-gateway \ + --memory="4g" \ + --memory-swap="4g" \ + ... +``` + +## Log Analysis + +### Access logs location + +```bash +# Default location +./higress/logs/access.log + +# View real-time logs +tail -f ./higress/logs/access.log +``` + +### Container logs + +```bash +# View all logs +docker logs higress-ai-gateway + +# Follow logs +docker logs -f higress-ai-gateway + +# Last 100 lines +docker logs --tail 100 higress-ai-gateway + +# With timestamps +docker logs -t higress-ai-gateway +``` + +## Network Issues + +### Cannot connect to gateway + +**Verify container is running:** +```bash +docker ps | grep higress-ai-gateway +``` + +**Check port bindings:** +```bash +docker port higress-ai-gateway +``` + +**Test from inside container:** +```bash +docker exec higress-ai-gateway curl localhost:8080/v1/models +``` + +**Check firewall rules:** +```bash +# Check if port is accessible +sudo ufw status | grep 8080 + +# Allow port (if needed) +sudo ufw allow 8080/tcp +``` + +### DNS resolution issues + +**Test from container:** +```bash +docker exec higress-ai-gateway ping -c 3 api.openai.com +``` + +**Check DNS settings:** +```bash +docker exec higress-ai-gateway cat /etc/resolv.conf +``` + +## Getting Help + +If you're still experiencing issues: + +1. **Collect logs:** + ```bash + docker logs higress-ai-gateway > gateway.log 2>&1 + cat ./higress/logs/access.log > access.log + ``` + +2. **Check system info:** + ```bash + docker version + docker info + uname -a + cat /proc/sys/fs/inotify/max_user_instances + ``` + +3. **Report issue:** + - Repository: https://github.com/higress-group/higress-standalone + - Include: logs, system info, deployment command used