Files
higress/plugins/wasm-go/extensions/ai-load-balancer/README_EN.md

241 lines
8.8 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: AI Load Balance
keywords: [higress, llm, load balance]
description: LLM-oriented load balance policies
---
# Introduction
**Attention**:
- Version of Higress should >= v2.1.5
This plug-in provides the llm-oriented load balancing capability in a hot-swappable manner. If the plugin is closed, the load balancing strategy will degenerate into the load balancing strategy of the service itself (round robin, local minimum request number, random, consistent hash, etc.).
The configuration is:
| Name | Type | Required | default | description |
|--------------------|-----------------|------------------|-------------|-------------------------------------|
| `lb_type` | string | optional | endpoint | load balance policy type, `endpoint` or `cluster` |
| `lb_policy` | string | required | | load balance policy type |
| `lb_config` | object | required | | configuration for the current load balance type |
When `lb_type = endpoint`, current supported load balance policies are:
- `global_least_request`: global least request based on redis
- `prefix_cache`: Select the backend node based on the prompt prefix match. If the node cannot be matched by prefix matching, the service node is selected based on the global minimum number of requests.
- `endpoint_metrics`: Load balancing based on metrics exposed by the llm service
When `lb_type = cluster`, current supported load balance policies are:
- `cluster_metrics`: Load balancing based on metrics of clusters
# Global Least Request
## Introduction
```mermaid
sequenceDiagram
participant C as Client
participant H as Higress
participant R as Redis
participant H1 as Host1
participant H2 as Host2
C ->> H: Send request
H ->> R: Get host ongoing request number
R ->> H: Return result
H ->> R: According to the result, select the host with the smallest number of current requests, host rq count +1.
R ->> H: Return result
H ->> H1: Bypass the service's original load balancing strategy and forward the request to the corresponding host
H1 ->> H: Return result
H ->> R: host rq count -1
H ->> C: Receive response
```
## Configuration
| Name | Type | required | default | description |
|--------------------|-----------------|------------------|-------------|-------------------------------------|
| `serviceFQDN` | string | required | | redis FQDN, e.g. `redis.dns` |
| `servicePort` | int | required | | redis port |
| `username` | string | required | | redis username |
| `password` | string | optional | `` | redis password |
| `timeout` | int | optional | 3000ms | redis request timeout |
| `database` | int | optional | 0 | redis database number |
## Configuration Example
```yaml
lb_type: endpoint
lb_policy: global_least_request
lb_config:
serviceFQDN: redis.static
servicePort: 6379
username: default
password: '123456'
```
# Prefix Cache
## Introduction
Select pods based on the prompt prefix match to reuse KV Cache. If no node can be matched by prefix match, select the service node based on the global minimum number of requests.
For example, the following request is routed to pod 1:
```json
{
"model": "qwen-turbo",
"messages": [
{
"role": "user",
"content": "hi"
}
]
}
```
Then subsequent requests with the same prefix will also be routed to pod 1:
```json
{
"model": "qwen-turbo",
"messages": [
{
"role": "user",
"content": "hi"
},
{
"role": "assistant",
"content": "Hi! How can I assist you today? 😊"
},
{
"role": "user",
"content": "write a short story aboud 100 words"
}
]
}
```
## Configuration
| Name | Type | required | default | description |
|--------------------|-----------------|-----------------------|-------------|---------------------------------|
| `serviceFQDN` | string | required | | redis FQDN, e.g. `redis.dns` |
| `servicePort` | int | required | | redis port |
| `username` | string | required | | redis username |
| `password` | string | optional | `` | redis password |
| `timeout` | int | optional | 3000ms | redis request timeout |
| `database` | int | optional | 0 | redis database number |
| `redisKeyTTL` | int | optional | 1800s | prompt prefix key's ttl |
## Configuration Example
```yaml
lb_type: endpoint
lb_policy: prefix_cache
lb_config:
serviceFQDN: redis.static
servicePort: 6379
username: default
password: '123456'
```
# Least Busy
## Introduction
wasm implementation for [gateway-api-inference-extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/README.md)
```mermaid
sequenceDiagram
participant C as Client
participant H as Higress
participant H1 as Host1
participant H2 as Host2
loop fetch metrics periodically
H ->> H1: /metrics
H1 ->> H: vllm metrics
H ->> H2: /metrics
H2 ->> H: vllm metrics
end
C ->> H: request
H ->> H1: select pod according to vllm metrics, bypassing original service load balance policy
H1 ->> H: response
H ->> C: response
```
<!-- flowchart for pod selection:
![](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/docs/scheduler-flowchart.png) -->
## Configuration
| Name | Type | Required | default | description |
|--------------------|-----------------|------------------|-------------|-------------------------------------|
| `metric_policy` | string | required | | How to use the metrics exposed by LLM for load balancing, currently supporting `[default, least, most]` |
| `target_metric` | string | optional | | The metric name to use. This is valid only when `metric_policy` is `least` or `most` |
| `rate_limit` | string | optional | 1 | The maximum percentage of requests a single node can receive, 0~1 |
## Configuration Example
Use the algorithm of [gateway-api-inference-extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/README.md):
```yaml
lb_type: endpoint
lb_policy: metrics_based
lb_config:
metric_policy: default
rate_limit: 0.6
```
Load balancing based on the current number of queued requests:
```yaml
lb_type: endpoint
lb_policy: metrics_based
lb_config:
metric_policy: least
target_metric: vllm:num_requests_waiting
rate_limit: 0.6
```
Load balancing based on the number of requests currently being processed by the GPU:
```yaml
lb_type: endpoint
lb_policy: metrics_based
lb_config:
metric_policy: least
target_metric: vllm:num_requests_running
rate_limit: 0.6
```
# Cross-service load balancing
## Configuration
| 名称 | 数据类型 | 填写要求 | 默认值 | 描述 |
|--------------------|-----------------|------------------|-------------|-------------------------------------|
| `mode` | string | required | | how to use cluster metrics, value of `[LeastBusy, LeastTotalLatency, LeastFirstTokenLatency ]` |
| `service_list` | []string | required | | service list of current route |
| `rate_limit` | string | optional | 1 | The maximum percentage of requests a single node can receive, value of 0~1 |
| `cluster_header` | string | optional | `x-envoy-target-cluster` | By retrieving the value of this header, we can determine which backend service to route to |
| `queue_size` | int | optional | 100 | The metrics is calculated based on the number of most recent requests. |
The meanings of the values for `mode` are as follows:
- `LeastBusy`: Routes to the service with the fewest concurrent requests.
- `LeastTotalLatency`: Routes to the service with the lowest response time (RT).
- `LeastFirstTokenLatency`: Routes to the service with the lowest RT for the first packet.
## Configuration Example
```yaml
lb_type: cluster
lb_policy: cluster_metrics
lb_config:
mode: LeastTotalLatency
rate_limit: 0.6
service_list:
- outbound|80||test-1.dns
- outbound|80||test-2.static
```