higress

LLM result caching plugin. The default configuration can be directly used for OpenAI protocol result caching, and supports caching of both streaming and non-streaming responses.

Tips

When carrying the request header x-higress-skip-ai-cache: on, the current request will not use content from the cache but will be directly forwarded to the backend service. Additionally, the response content from this request will not be cached.

Runtime Properties

Plugin Execution Phase: Authentication Phase Plugin Execution Priority: 10

Configuration Description

The configuration is divided into 3 parts: Vector Database (vector), Text Embedding Service (embedding), and Cache Service (cache). It also provides fine-grained LLM request/response extraction parameter configurations.

This plugin supports both vector database-based semantic caching and string matching-based caching methods. If both vector database and cache database are configured, the cache database is used first, and the vector database capability is used when cache misses occur.

Note: Vector database (vector) and cache database (cache) cannot both be empty, otherwise this plugin cannot provide caching services.

Name	Type	Requirement	Default	Description
vector	object	optional	-	Vector storage service configuration, see Vector Database Service section below
embedding	object	optional	-	Text embedding service configuration, see Text Embedding Service section below
cache	object	optional	-	Cache service configuration, see Cache Service section below
cacheKeyStrategy	string	optional	"lastQuestion"	Strategy for generating cache key from historical questions. Options: "lastQuestion" (use last question), "allQuestions" (concatenate all questions), or "disabled" (disable caching)
enableSemanticCache	bool	optional	false	Whether to enable semantic caching. If disabled, string matching is used to find cache, requiring cache service configuration. Automatically enabled when a vector provider is configured

Depending on whether semantic caching is needed, you can configure component combinations as follows:

cache: Enable string matching cache only
vector (+ embedding): Enable semantic caching. If vector does not provide string representation service, you need to configure embedding service separately
vector (+ embedding) + cache: Enable semantic caching and use cache service to store LLM responses for acceleration

If you do not configure a related component, you can ignore the required fields of that component.

Vector Database Service (vector)

Name	Type	Requirement	Default	Description
vector.type	string	required	-	Vector storage service provider type, e.g., dashvector, chroma, elasticsearch, weaviate, pinecone, qdrant, milvus
vector.serviceName	string	required	-	Vector storage service name
vector.serviceHost	string	optional	-	Vector storage service domain. Required for some providers (e.g., dashvector, pinecone)
vector.servicePort	int64	optional	443	Vector storage service port
vector.apiKey	string	optional	-	Vector storage service API Key
vector.topK	int	optional	1	Return TopK results
vector.timeout	uint32	optional	10000	Timeout for requesting vector storage service, in milliseconds. Default is 10000 (10 seconds)
vector.collectionID	string	optional	-	Vector storage service Collection ID
vector.threshold	float64	optional	1000	Vector similarity measurement threshold
vector.thresholdRelation	string	optional	"lt"	Similarity measurement comparison method. Similarity measurement methods include `Cosine`, `DotProduct`, `Euclidean`, etc. The first two have higher similarity with larger values, while the latter has higher similarity with smaller values. Use `gt` for `Cosine` and `DotProduct`, and `lt` for `Euclidean`. All options include `lt` (less than), `lte` (less than or equal to), `gt` (greater than), `gte` (greater than or equal to)
vector.esUsername	string	optional	-	ElasticSearch username, only for elasticsearch type
vector.esPassword	string	optional	-	ElasticSearch password, only for elasticsearch type

Text Embedding Service (embedding)

Name	Type	Requirement	Default	Description
embedding.type	string	required	-	Text embedding service type, e.g., dashscope, openai, azure, cohere, ollama, huggingface, textin, xfyun
embedding.serviceName	string	required	-	Text embedding service name
embedding.serviceHost	string	optional	-	Text embedding service domain
embedding.servicePort	int64	optional	443	Text embedding service port. Default varies by provider; ollama defaults to 11434
embedding.timeout	uint32	optional	10000	Timeout for requesting text embedding service, in milliseconds. Default is 10000 (10 seconds)
embedding.model	string	optional	-	Model name for text embedding service
embedding.apiKey	string	optional	-	API Key for text embedding service

Cache Service (cache)

Name	Type	Requirement	Default	Description
cache.type	string	required	-	Cache service type, e.g., redis
cache.serviceName	string	required	-	Cache service name
cache.serviceHost	string	optional	-	Cache service domain
cache.servicePort	int64	optional	6379	Cache service port. If serviceName ends with .static, default is 80
cache.username	string	optional	-	Cache service username
cache.password	string	optional	-	Cache service password
cache.timeout	uint32	optional	10000	Cache service timeout, in milliseconds. Default is 10000 (10 seconds)
cache.cacheTTL	int	optional	0	Cache expiration time, in seconds. Default is 0 (never expire)
cache.cacheKeyPrefix	string	optional	"higress-ai-cache:"	Prefix for cache keys
cache.database	int	optional	0	Database ID to use, only for Redis. For example, configure as 1 for `SELECT 1`

Other Configurations

Name	Type	Requirement	Default	Description
cacheKeyFrom	string	optional	"messages.@reverse.0.content"	Extract string from request Body using GJSON PATH syntax
cacheValueFrom	string	optional	"choices.0.message.content"	Extract string from response Body using GJSON PATH syntax
cacheStreamValueFrom	string	optional	"choices.0.delta.content"	Extract string from streaming response Body using GJSON PATH syntax
cacheToolCallsFrom	string	optional	"choices.0.delta.content.tool_calls"	Extract string from streaming response Body using GJSON PATH syntax
responseTemplate	string	optional	`{"id":"from-cache","choices":[{"index":0,"message":{"role":"assistant","content":"%s"},"finish_reason":"stop"}],"model":"from-cache","object":"chat.completion","usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}`	Template for returning HTTP response, with %s marking the part to be replaced by cache value
streamResponseTemplate	string	optional	`data:{"id":"from-cache","choices":[{"index":0,"delta":{"role":"assistant","content":"%s"},"finish_reason":"stop"}],"model":"from-cache","object":"chat.completion","usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}\n\ndata:[DONE]\n\n`	Template for returning streaming HTTP response, with %s marking the part to be replaced by cache value

Text Embedding Provider Specific Configurations

Azure OpenAI

For Azure OpenAI, set embedding.type to azure. You need to first create an Azure OpenAI account, then select and deploy a model in Azure AI Foundry. Click on your deployed model to see the target URI and key in the endpoint section. Please enter the host from the URI in embedding.serviceHost and the key in embedding.apiKey.

A complete URI example is https://YOUR_RESOURCE_NAME.openai.azure.com/openai/deployments/YOUR_DEPLOYMENT_NAME/embeddings?api-version=2024-10-21. You need to enter YOUR_RESOURCE_NAME.openai.azure.com in embedding.serviceHost.

Specific configuration fields:

Name	Data Type	Requirement	Default	Description
`embedding.apiVersion`	string	required	-	API version, the api-version value from the obtained URI

Note that you must specify embedding.serviceHost, such as YOUR_RESOURCE_NAME.openai.azure.com. The default model is text-embedding-ada-002. For other models, specify in embedding.model.

Cohere

For Cohere, set embedding.type to cohere. There are no specific configuration fields. You need to create an API Key and enter it in embedding.apiKey.

OpenAI

For OpenAI, set embedding.type to openai. There are no specific configuration fields. You need to create an API Key and enter it in embedding.apiKey. An API Key example is sk-xxxxxxx.

Ollama

For Ollama, set embedding.type to ollama. There are no specific configuration fields.

Hugging Face

For Hugging Face, set embedding.type to huggingface. There are no specific configuration fields. You need to create an hf_token and enter it in embedding.apiKey. An hf_token example is hf_xxxxxxx.

embedding.model defaults to sentence-transformers/all-MiniLM-L6-v2.

DashScope

For DashScope, set embedding.type to dashscope. You need to create an API Key and enter it in embedding.apiKey.

embedding.model defaults to text-embedding-v2. Other models like text-embedding-v1 can also be used.

TextIn

For TextIn, set embedding.type to textin. You need to first obtain app-id and secret-code.

Specific configuration fields:

Name	Data Type	Requirement	Default	Description
`embedding.textinAppId`	string	required	-	Application ID, obtained app-id
`embedding.textinSecretCode`	string	required	-	Secret for calling API, obtained secret-code
`embedding.textinMatryoshkaDim`	int	required	-	Dimension of returned single vector

Xfyun (讯飞星火)

For Xfyun, set embedding.type to xfyun. You need to first create an application to obtain APPID, APISecret, and APIKey, and enter APIKey in embedding.apiKey.

Specific configuration fields:

Name	Data Type	Requirement	Default	Description
`embedding.appId`	string	required	-	Application ID, obtained APPID
`embedding.apiSecret`	string	required	-	Secret for calling API, obtained APISecret

Vector Database Provider Specific Configurations

Chroma

For Chroma, set vector.type to chroma. There are no specific configuration fields. You need to create a Collection in advance and fill in the Collection ID in vector.collectionID. A Collection ID example is 52bbb8b3-724c-477b-a4ce-d5b578214612.

DashVector

For DashVector, set vector.type to dashvector. There are no specific configuration fields. You need to create a Collection in advance and fill in the Collection Name in vector.collectionID.

ElasticSearch

For ElasticSearch, set vector.type to elasticsearch. You need to create an Index in advance and fill in the Index Name in vector.collectionID.

It currently relies on the KNN method. Please ensure your ES version supports KNN. It has been tested on version 8.16.

Specific configuration fields:

Name	Data Type	Requirement	Default	Description
`vector.esUsername`	string	optional	-	ElasticSearch username
`vector.esPassword`	string	optional	-	ElasticSearch password

vector.esUsername and vector.esPassword are used for Basic authentication. API Key authentication is also supported. When vector.apiKey is filled in, API Key authentication is enabled. For SaaS versions, you need to fill in the encoded value.

Milvus

For Milvus, set vector.type to milvus. There are no specific configuration fields. You need to create a Collection in advance and fill in the Collection Name in vector.collectionID.

Pinecone

For Pinecone, set vector.type to pinecone. There are no specific configuration fields. You need to create an Index in advance and fill in the Index access domain in vector.serviceHost.

The Namespace parameter in Pinecone is configured through the plugin's vector.collectionID. If vector.collectionID is not filled in, it defaults to the Default Namespace.

Qdrant

For Qdrant, set vector.type to qdrant. There are no specific configuration fields. You need to create a Collection in advance and fill in the Collection Name in vector.collectionID.

Weaviate

For Weaviate, set vector.type to weaviate. There are no specific configuration fields. You need to create a Collection in advance and fill in the Collection Name in vector.collectionID.

Note that Weaviate automatically capitalizes the first letter, so when filling in collectionID, the first letter should be capitalized.

If using SaaS, you need to fill in the vector.serviceHost parameter.

Configuration Example

Basic Configuration

embedding:
  type: dashscope
  serviceName: my_dashscope.dns
  apiKey: [Your Key]

vector:
  type: dashvector
  serviceName: my_dashvector.dns
  collectionID: [Your Collection ID]
  serviceHost: [Your domain]
  apiKey: [Your key]

cache:
  type: redis
  serviceName: my_redis.dns
  servicePort: 6379
  timeout: 100

Advanced Usage

The current default cache key is extracted based on the GJSON PATH expression: messages.@reverse.0.content, which means reversing the messages array and taking the content of the first item.

GJSON PATH supports conditional syntax. For example, to get the content of the last role as user as the key, you can write: messages.@reverse.#(role=="user").content;

If you want to concatenate all content with role as user into an array as the key, you can write: messages.@reverse.#(role=="user")#.content;

It also supports pipeline syntax. For example, to get the second-to-last role as user as the key, you can write: messages.@reverse.#(role=="user")#.content|1.

For more usage, please refer to the official documentation. You can use the GJSON Playground for syntax testing.

FAQ

If the returned error is error status returned by host: bad argument, please check if serviceName correctly includes the service type suffix (.dns, etc.).