Files
higress/.claude/skills/higress-openclaw-integration/references/TROUBLESHOOTING.md

326 lines
5.9 KiB
Markdown

# Higress AI Gateway - Troubleshooting
Common issues and solutions for Higress AI Gateway deployment and operation.
## Container Issues
### Container fails to start
**Check Docker is running:**
```bash
docker info
```
**Check port availability:**
```bash
netstat -tlnp | grep 8080
```
**View container logs:**
```bash
docker logs higress-ai-gateway
```
### Gateway not responding
**Check container status:**
```bash
docker ps -a
```
**Verify port mapping:**
```bash
docker port higress-ai-gateway
```
**Test locally:**
```bash
curl http://localhost:8080/v1/models
```
## File System Issues
### "too many open files" error from API server
**Symptom:**
```
panic: unable to create REST storage for a resource due to too many open files, will die
```
or
```
command failed err="failed to create shared file watcher: too many open files"
```
**Root Cause:**
The system's `fs.inotify.max_user_instances` limit is too low. This commonly occurs on systems with many Docker containers, as each container can consume inotify instances.
**Check current limit:**
```bash
cat /proc/sys/fs/inotify/max_user_instances
```
Default is often 128, which is insufficient when running multiple containers.
**Solution:**
Increase the inotify instance limit to 8192:
```bash
# Temporarily (until next reboot)
sudo sysctl -w fs.inotify.max_user_instances=8192
# Permanently (survives reboots)
echo "fs.inotify.max_user_instances = 8192" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
```
**Verify:**
```bash
cat /proc/sys/fs/inotify/max_user_instances
# Should output: 8192
```
**Restart the container:**
```bash
docker restart higress-ai-gateway
```
**Additional inotify tunables** (if still experiencing issues):
```bash
# Increase max watches per user
sudo sysctl -w fs.inotify.max_user_watches=524288
# Increase max queued events
sudo sysctl -w fs.inotify.max_queued_events=32768
```
To make these permanent as well:
```bash
echo "fs.inotify.max_user_watches = 524288" | sudo tee -a /etc/sysctl.conf
echo "fs.inotify.max_queued_events = 32768" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
```
## Plugin Issues
### Plugin not recognized
**Verify plugin installation:**
For Clawdbot:
```bash
ls -la ~/.clawdbot/extensions/higress-ai-gateway
```
For OpenClaw:
```bash
ls -la ~/.openclaw/extensions/higress-ai-gateway
```
**Check package.json:**
Ensure `package.json` contains the correct extension field:
- Clawdbot: `"clawdbot.extensions"`
- OpenClaw: `"openclaw.extensions"`
**Restart the runtime:**
```bash
# Restart Clawdbot gateway
clawdbot gateway restart
# Or OpenClaw gateway
openclaw gateway restart
```
## Routing Issues
### Auto-routing not working
**Confirm model is in list:**
```bash
# Check if higress/auto is available
clawdbot models list | grep "higress/auto"
```
**Check routing rules exist:**
```bash
./get-ai-gateway.sh route list
```
**Verify default model is configured:**
```bash
./get-ai-gateway.sh config list
```
**Check gateway logs:**
```bash
docker logs higress-ai-gateway | grep -i routing
```
**View access logs:**
```bash
tail -f ./higress/logs/access.log
```
## Configuration Issues
### Timezone detection fails
**Manually check timezone:**
```bash
timedatectl show --property=Timezone --value
```
**Or check timezone file:**
```bash
cat /etc/timezone
```
**Fallback behavior:**
- If detection fails, defaults to Hangzhou mirror
- Manual override: Set `IMAGE_REPO` environment variable
**Manual repository selection:**
```bash
# For China/Asia
IMAGE_REPO="higress-registry.cn-hangzhou.cr.aliyuncs.com/higress/all-in-one"
# For Southeast Asia
IMAGE_REPO="higress-registry.ap-southeast-7.cr.aliyuncs.com/higress/all-in-one"
# For North America
IMAGE_REPO="higress-registry.us-west-1.cr.aliyuncs.com/higress/all-in-one"
# Use in deployment
IMAGE_REPO="$IMAGE_REPO" ./get-ai-gateway.sh start --non-interactive ...
```
## Performance Issues
### Slow image downloads
**Check selected repository:**
```bash
echo $IMAGE_REPO
```
**Manually select closest mirror:**
See [Configuration Issues → Timezone detection fails](#timezone-detection-fails) for manual repository selection.
### High memory usage
**Check container stats:**
```bash
docker stats higress-ai-gateway
```
**View resource limits:**
```bash
docker inspect higress-ai-gateway | grep -A 10 "HostConfig"
```
**Set memory limits:**
```bash
# Stop container
./get-ai-gateway.sh stop
# Manually restart with limits
docker run -d \
--name higress-ai-gateway \
--memory="4g" \
--memory-swap="4g" \
...
```
## Log Analysis
### Access logs location
```bash
# Default location
./higress/logs/access.log
# View real-time logs
tail -f ./higress/logs/access.log
```
### Container logs
```bash
# View all logs
docker logs higress-ai-gateway
# Follow logs
docker logs -f higress-ai-gateway
# Last 100 lines
docker logs --tail 100 higress-ai-gateway
# With timestamps
docker logs -t higress-ai-gateway
```
## Network Issues
### Cannot connect to gateway
**Verify container is running:**
```bash
docker ps | grep higress-ai-gateway
```
**Check port bindings:**
```bash
docker port higress-ai-gateway
```
**Test from inside container:**
```bash
docker exec higress-ai-gateway curl localhost:8080/v1/models
```
**Check firewall rules:**
```bash
# Check if port is accessible
sudo ufw status | grep 8080
# Allow port (if needed)
sudo ufw allow 8080/tcp
```
### DNS resolution issues
**Test from container:**
```bash
docker exec higress-ai-gateway ping -c 3 api.openai.com
```
**Check DNS settings:**
```bash
docker exec higress-ai-gateway cat /etc/resolv.conf
```
## Getting Help
If you're still experiencing issues:
1. **Collect logs:**
```bash
docker logs higress-ai-gateway > gateway.log 2>&1
cat ./higress/logs/access.log > access.log
```
2. **Check system info:**
```bash
docker version
docker info
uname -a
cat /proc/sys/fs/inotify/max_user_instances
```
3. **Report issue:**
- Repository: https://github.com/higress-group/higress-standalone
- Include: logs, system info, deployment command used