This plug-in provides the llm-oriented load balancing capability in a hot-swappable manner. If the plugin is closed, the load balancing strategy will degenerate into the load balancing strategy of the service itself (round robin, local minimum request number, random, consistent hash, etc.).

The configuration is:

Name	Type	Required	default	description
`lb_type`	string	optional	endpoint	load balance policy type, `endpoint` or `cluster`
`lb_policy`	string	required		load balance policy type
`lb_config`	object	required		configuration for the current load balance type

When lb_type = endpoint, current supported load balance policies are:

global_least_request: global least request based on redis
prefix_cache: Select the backend node based on the prompt prefix match. If the node cannot be matched by prefix matching, the service node is selected based on the global minimum number of requests.
endpoint_metrics: Load balancing based on metrics exposed by the llm service

When lb_type = cluster, current supported load balance policies are:

cluster_metrics: Load balancing based on metrics of clusters

Global Least Request

Introduction

sequenceDiagram
	participant C as Client
	participant H as Higress
	participant R as Redis
	participant H1 as Host1
	participant H2 as Host2

	C ->> H: Send request
	H ->> R: Get host ongoing request number
	R ->> H: Return result
	H ->> R: According to the result, select the host with the smallest number of current requests, host rq count +1.
	R ->> H: Return result
	H ->> H1: Bypass the service's original load balancing strategy and forward the request to the corresponding host
	H1 ->> H: Return result
	H ->> R: host rq count -1
	H ->> C: Receive response

Configuration

Name	Type	required	default	description
`serviceFQDN`	string	required		redis FQDN, e.g. `redis.dns`
`servicePort`	int	required		redis port
`username`	string	required		redis username
`password`	string	optional	``	redis password
`timeout`	int	optional	3000ms	redis request timeout
`database`	int	optional	0	redis database number

Configuration Example

lb_type: endpoint
lb_policy: global_least_request
lb_config:
  serviceFQDN: redis.static
  servicePort: 6379
  username: default
  password: '123456'

Prefix Cache

Introduction

Select pods based on the prompt prefix match to reuse KV Cache. If no node can be matched by prefix match, select the service node based on the global minimum number of requests.

For example, the following request is routed to pod 1:

{
  "model": "qwen-turbo",
  "messages": [
    {
      "role": "user",
      "content": "hi"
    }
  ]
}

Then subsequent requests with the same prefix will also be routed to pod 1:

{
  "model": "qwen-turbo",
  "messages": [
    {
      "role": "user",
      "content": "hi"
    },
    {
      "role": "assistant",
      "content": "Hi! How can I assist you today? 😊"
    },
    {
      "role": "user",
      "content": "write a short story aboud 100 words"
    }
  ]
}

Configuration

Name	Type	required	default	description
`serviceFQDN`	string	required		redis FQDN, e.g. `redis.dns`
`servicePort`	int	required		redis port
`username`	string	required		redis username
`password`	string	optional	``	redis password
`timeout`	int	optional	3000ms	redis request timeout
`database`	int	optional	0	redis database number
`redisKeyTTL`	int	optional	1800s	prompt prefix key's ttl

Configuration Example

lb_type: endpoint
lb_policy: prefix_cache
lb_config:
  serviceFQDN: redis.static
  servicePort: 6379
  username: default
  password: '123456'

Least Busy

Introduction

wasm implementation for gateway-api-inference-extension

sequenceDiagram
	participant C as Client
	participant H as Higress
	participant H1 as Host1
	participant H2 as Host2

	loop fetch metrics periodically
		H ->> H1: /metrics
		H1 ->> H: vllm metrics
		H ->> H2: /metrics
		H2 ->> H: vllm metrics
	end

	C ->> H: request
	H ->> H1: select pod according to vllm metrics, bypassing original service load balance policy
	H1 ->> H: response
	H ->> C: response

Configuration

Name	Type	Required	default	description
`metric_policy`	string	required		How to use the metrics exposed by LLM for load balancing, currently supporting `[default, least, most]`
`target_metric`	string	optional		The metric name to use. This is valid only when `metric_policy` is `least` or `most`
`rate_limit`	string	optional	1	The maximum percentage of requests a single node can receive, 0~1

Configuration Example

Use the algorithm of gateway-api-inference-extension:

lb_type: endpoint
lb_policy: metrics_based
lb_config:
  metric_policy: default
  rate_limit: 0.6

Load balancing based on the current number of queued requests:

lb_type: endpoint
lb_policy: metrics_based
lb_config:
  metric_policy: least
  target_metric: vllm:num_requests_waiting
  rate_limit: 0.6

Load balancing based on the number of requests currently being processed by the GPU:

lb_type: endpoint
lb_policy: metrics_based
lb_config:
  metric_policy: least
  target_metric: vllm:num_requests_running
  rate_limit: 0.6

Cross-service load balancing

Configuration

名称	数据类型	填写要求	默认值	描述
`mode`	string	required		how to use cluster metrics, value of `[LeastBusy, LeastTotalLatency, LeastFirstTokenLatency ]`
`service_list`	[]string	required		service list of current route
`rate_limit`	string	optional	1	The maximum percentage of requests a single node can receive, value of 0~1
`cluster_header`	string	optional	`x-envoy-target-cluster`	By retrieving the value of this header, we can determine which backend service to route to
`queue_size`	int	optional	100	The metrics is calculated based on the number of most recent requests.

The meanings of the values for mode are as follows:

LeastBusy: Routes to the service with the fewest concurrent requests.
LeastTotalLatency: Routes to the service with the lowest response time (RT).
LeastFirstTokenLatency: Routes to the service with the lowest RT for the first packet.

Configuration Example

lb_type: cluster
lb_policy: cluster_metrics
lb_config:
  mode: LeastTotalLatency
  rate_limit: 0.6
  service_list:
  - outbound|80||test-1.dns
  - outbound|80||test-2.static

README_EN.md Unescape Escape

Introduction

Global Least Request

Introduction

Configuration

Configuration Example

Prefix Cache

Introduction

Configuration

Configuration Example

Least Busy

Introduction

Configuration

Configuration Example

Cross-service load balancing

Configuration

Configuration Example

README_EN.md