[feat] load balancing across different clusters and endpoints based on metrics (#3063)

2026-06-09 12:47:28 +08:00 · 2025-11-25 10:32:34 +08:00
parent 7a504fd67d
commit 42334f21df
12 changed files with 764 additions and 126 deletions
--- a/plugins/wasm-go/extensions/ai-load-balancer/README_EN.md
+++ b/plugins/wasm-go/extensions/ai-load-balancer/README_EN.md
@@ -15,14 +15,19 @@ The configuration is:

 | Name                | Type         | Required          | default       | description                                 |
 |--------------------|-----------------|------------------|-------------|-------------------------------------|
-| `lb_policy`      | string          | required              |             | load balance type    |
+| `lb_type`        | string          | optional              | endpoint    | load balance policy type, `endpoint` or `cluster` |
+| `lb_policy`      | string          | required              |             | load balance policy type    |
 | `lb_config`      | object          | required              |             | configuration for the current load balance type    |

-Current supported load balance policies are:
+When `lb_type = endpoint`, current supported load balance policies are:

 - `global_least_request`: global least request based on redis
 - `prefix_cache`: Select the backend node based on the prompt prefix match. If the node cannot be matched by prefix matching, the service node is selected based on the global minimum number of requests.
- `least_busy`: implementation for [gateway-api-inference-extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/README.md)
+- `endpoint_metrics`: Load balancing based on metrics exposed by the llm service
+
+When `lb_type = cluster`, current supported load balance policies are:
+- `cluster_metrics`: Load balancing based on metrics of clusters
+

 # Global Least Request
 ## Introduction
@@ -60,6 +65,7 @@ sequenceDiagram
 ## Configuration Example

 ```yaml
+lb_type: endpoint
 lb_policy: global_least_request
 lb_config:
  serviceFQDN: redis.static
@@ -118,11 +124,12 @@ Then subsequent requests with the same prefix will also be routed to pod 1:
 | `password`         | string          | optional              | ``          | redis password                  |
 | `timeout`          | int             | optional              | 3000ms      | redis request timeout           |
 | `database`         | int             | optional              | 0           | redis database number           |
-| `redisKeyTTL`      | int             | optional              | 1800ms      | prompt prefix key's ttl         |
+| `redisKeyTTL`      | int             | optional              | 1800s      | prompt prefix key's ttl         |

 ## Configuration Example

 ```yaml
+lb_type: endpoint
 lb_policy: prefix_cache
 lb_config:
  serviceFQDN: redis.static
@@ -164,14 +171,71 @@ sequenceDiagram

 | Name                | Type         | Required          | default       | description                                 |
 |--------------------|-----------------|------------------|-------------|-------------------------------------|
-| `criticalModels`      | []string          | required              |             | critical model names    |
+| `metric_policy`      | string | required | | How to use the metrics exposed by LLM for load balancing, currently supporting `[default, least, most]` |
+| `target_metric`      | string | optional | | The metric name to use. This is valid only when `metric_policy` is `least` or `most` |
+| `rate_limit`      | string | optional | 1 | The maximum percentage of requests a single node can receive, 0~1 |
+
+## Configuration Example
+
+Use the algorithm of [gateway-api-inference-extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/README.md):
+
+```yaml
+lb_type: endpoint
+lb_policy: metrics_based
+lb_config:
+  metric_policy: default
+  rate_limit: 0.6
+```
+
+Load balancing based on the current number of queued requests: 
+
+```yaml
+lb_type: endpoint
+lb_policy: metrics_based
+lb_config:
+  metric_policy: least
+  target_metric: vllm:num_requests_waiting
+  rate_limit: 0.6
+```
+
+Load balancing based on the number of requests currently being processed by the GPU:
+
+```yaml
+lb_type: endpoint
+lb_policy: metrics_based
+lb_config:
+  metric_policy: least
+  target_metric: vllm:num_requests_running
+  rate_limit: 0.6
+```
+
+# Cross-service load balancing
+
+## Configuration
+
+| 名称                | 数据类型         | 填写要求          | 默认值       | 描述                                 |
+|--------------------|-----------------|------------------|-------------|-------------------------------------|
+| `mode`      | string | required | | how to use cluster metrics, value of `[LeastBusy, LeastTotalLatency, LeastFirstTokenLatency ]` |
+| `service_list`      | []string | required | | service list of current route |
+| `rate_limit`      | string | optional | 1 | The maximum percentage of requests a single node can receive, value of 0~1 |
+| `cluster_header` | string | optional | `x-envoy-target-cluster` | By retrieving the value of this header, we can determine which backend service to route to |
+| `queue_size`      | int | optional | 100 | The metrics is calculated based on the number of most recent requests. |
+
+The meanings of the values for `mode` are as follows:
+
+- `LeastBusy`: Routes to the service with the fewest concurrent requests.
+- `LeastTotalLatency`: Routes to the service with the lowest response time (RT).
+- `LeastFirstTokenLatency`: Routes to the service with the lowest RT for the first packet.

 ## Configuration Example

 ```yaml
-lb_policy: least_busy
+lb_type: cluster
+lb_policy: cluster_metrics
 lb_config:
-  criticalModels:
-  - meta-llama/Llama-2-7b-hf
-  - sql-lora
-```
+  mode: LeastTotalLatency
+  rate_limit: 0.6
+  service_list:
+  - outbound|80||test-1.dns
+  - outbound|80||test-2.static
+```