update plugins doc (#1305)

2026-05-09 21:37:31 +08:00 · 2024-09-12 21:48:40 +08:00
parent 0f9113ed82
commit c7c4ae1da2
80 changed files with 7373 additions and 2368 deletions
--- a/plugins/wasm-cpp/extensions/bot_detect/README.md
+++ b/plugins/wasm-cpp/extensions/bot_detect/README.md
@@ -1,11 +1,21 @@
-<p>
-   <a href="README_EN.md"> English </a> | 中文
-</p>
+---
+title: Bot 拦截
+keywords: [higress,bot detect]
+description: Bot 拦截插件配置参考
+---
+
+
+## 功能说明

-# 功能说明
 `bot-detect`插件可以用于识别并阻止互联网爬虫对站点资源的爬取

-# 配置字段
+## 运行属性
+
+插件执行阶段：`授权阶段`
+插件执行优先级：`310`
+
+
+## 配置字段

 | 名称 | 数据类型 | 填写要求 |  默认值 | 描述 |
 | -------- | -------- | -------- | -------- | -------- |
@@ -33,9 +43,9 @@
    (CSimpleSpider|Cityreview Robot|CrawlDaddy|CrawlFire|Finderbots|Index crawler|Job Roboter|KiwiStatus Spider|Lijit Crawler|QuerySeekerSpider|ScollSpider|Trends Crawler|USyd-NLP-Spider|SiteCat Webbot|BotName\/\$BotVersion|123metaspider-Bot|1470\.net crawler|50\.nu|8bo Crawler Bot|Aboundex|Accoona-[A-z]{1,30}-Agent|AdsBot-Google(?:-[a-z]{1,30}|)|altavista|AppEngine-Google|archive.{0,30}\.org_bot|archiver|Ask Jeeves|[Bb]ai[Dd]u[Ss]pider(?:-[A-Za-z]{1,30})(?:-[A-Za-z]{1,30}|)|bingbot|BingPreview|blitzbot|BlogBridge|Bloglovin|BoardReader Blog Indexer|BoardReader Favicon Fetcher|boitho.com-dc|BotSeer|BUbiNG|\b\w{0,30}favicon\w{0,30}\b|\bYeti(?:-[a-z]{1,30}|)|Catchpoint(?: bot|)|[Cc]harlotte|Checklinks|clumboot|Comodo HTTP\(S\) Crawler|Comodo-Webinspector-Crawler|ConveraCrawler|CRAWL-E|CrawlConvera|Daumoa(?:-feedfetcher|)|Feed Seeker Bot|Feedbin|findlinks|Flamingo_SearchEngine|FollowSite Bot|furlbot|Genieo|gigabot|GomezAgent|gonzo1|(?:[a-zA-Z]{1,30}-|)Googlebot(?:-[a-zA-Z]{1,30}|)|Google SketchUp|grub-client|gsa-crawler|heritrix|HiddenMarket|holmes|HooWWWer|htdig|ia_archiver|ICC-Crawler|Icarus6j|ichiro(?:/mobile|)|IconSurf|IlTrovatore(?:-Setaccio|)|InfuzApp|Innovazion Crawler|InternetArchive|IP2[a-z]{1,30}Bot|jbot\b|KaloogaBot|Kraken|Kurzor|larbin|LEIA|LesnikBot|Linguee Bot|LinkAider|LinkedInBot|Lite Bot|Llaut|lycos|Mail\.RU_Bot|masscan|masidani_bot|Mediapartners-Google|Microsoft .{0,30} Bot|mogimogi|mozDex|MJ12bot|msnbot(?:-media {0,2}|)|msrbot|Mtps Feed Aggregation System|netresearch|Netvibes|NewsGator[^/]{0,30}|^NING|Nutch[^/]{0,30}|Nymesis|ObjectsSearch|OgScrper|Orbiter|OOZBOT|PagePeeker|PagesInventory|PaxleFramework|Peeplo Screenshot Bot|PlantyNet_WebRobot|Pompos|Qwantify|Read%20Later|Reaper|RedCarpet|Retreiver|Riddler|Rival IQ|scooter|Scrapy|Scrubby|searchsight|seekbot|semanticdiscovery|SemrushBot|Simpy|SimplePie|SEOstats|SimpleRSS|SiteCon|Slackbot-LinkExpanding|Slack-ImgProxy|Slurp|snappy|Speedy Spider|Squrl Java|Stringer|TheUsefulbot|ThumbShotsBot|Thumbshots\.ru|Tiny Tiny RSS|Twitterbot|WhatsApp|URL2PNG|Vagabondo|VoilaBot|^vortex|Votay bot|^voyager|WASALive.Bot|Web-sniffer|WebThumb|WeSEE:[A-z]{1,30}|WhatWeb|WIRE|WordPress|Wotbox|www\.almaden\.ibm\.com|Xenu(?:.s|) Link Sleuth|Xerka [A-z]{1,30}Bot|yacy(?:bot|)|YahooSeeker|Yahoo! Slurp|Yandex\w{1,30}|YodaoBot(?:-[A-z]{1,30}|)|YottaaMonitor|Yowedo|^Zao|^Zao-Crawler|ZeBot_www\.ze\.bz|ZooShot|ZyBorg)(?:[ /]v?(\d+)(?:\.(\d+)(?:\.(\d+)|)|)|)
 ```

-# 配置示例
+## 配置示例

-## 放行原本命中爬虫规则的请求
+### 放行原本命中爬虫规则的请求
 ```yaml
 allow:
 - ".*Go-http-client.*"
@@ -44,7 +54,7 @@ allow:
 若不作该配置，默认的 Golang 网络库请求会被视做爬虫，被禁止访问


-## 增加爬虫判断
+### 增加爬虫判断
 ```yaml
 deny:
 - "spd-tools.*"
@@ -56,24 +66,3 @@ deny:
 curl http://example.com -H 'User-Agent: spd-tools/1.1'
 curl http://exmaple.com -H 'User-Agent: spd-tools'
 ```
-
-## 对特定路由或域名开启
-```yaml
-# 使用 _rules_ 字段进行细粒度规则配置
-_rules_:
-# 规则一：按路由名称匹配生效
- _match_route_:
-  - route-a
-  - route-b
-# 规则二：按域名匹配生效
- _match_domain_:
-  - "*.example.com"
-  - test.com
-  allow:
-  - ".*Go-http-client.*"
-```
-此例 `_match_route_` 中指定的 `route-a` 和 `route-b` 即在创建网关路由时填写的路由名称，当匹配到这两个路由时，将使用此段配置；
-此例 `_match_domain_` 中指定的 `*.example.com` 和 `test.com` 用于匹配请求的域名，当发现域名匹配时，将使用此段配置；
-配置的匹配生效顺序，将按照 `_rules_` 下规则的排列顺序，匹配第一个规则后生效对应配置，后续规则将被忽略。
-
-
--- a/plugins/wasm-cpp/extensions/bot_detect/README_EN.md
+++ b/plugins/wasm-cpp/extensions/bot_detect/README_EN.md
@@ -1,22 +1,26 @@
-<p>
-   English | <a href="README.md">中文</a>
-</p>
+---
+title: Bot Detect
+keywords: [higress, bot detect]
+description: Bot detect plugin configuration reference
+---
+## Function Description
+The `bot-detect` plugin can be used to identify and block internet crawlers from accessing site resources.

-# Description
-`bot-detect` plugin can be used to identify and prevent web crawlers from crawling websites.
+## Running Properties
+Plugin Execution Phase: `Authorization Phase`
+Plugin Execution Priority: `310`

-# Configuration Fields
+## Configuration Fields
+| Name              | Data Type           | Required      | Default Value | Description                                                |
+| ----------------- | ------------------- | --------------| --------------| ---------------------------------------------------------- |
+| allow             | array of string     | Optional      | -             | Regular expressions to match the User-Agent request header; requests matching will be allowed to access. |
+| deny              | array of string     | Optional      | -             | Regular expressions to match the User-Agent request header; requests matching will be blocked. |
+| blocked_code      | number              | Optional      | 403           | HTTP status code returned when a request is blocked.      |
+| blocked_message   | string              | Optional      | -             | HTTP response body returned when a request is blocked.    |

-| Name | Type | Requirement |  Default Value | Description |
-| -------- | -------- | -------- | -------- | -------- |
-|  allow     |  array of string     | Optional     |   -  |  A regular expression to match the User-Agent request header and will allow access if the match hits   |
-|  deny     |  array of string     | Optional     |   -  |  A regular expression to match the User-Agent request header and will block the request if the match hits   |
-|  blocked_code     |  number     | Optional     |   403  |  The HTTP status code returned when a request is blocked   |
-|  blocked_message     |  string     | Optional     |   -  |  The HTTP response Body returned when a request is blocked   |
+The `allow` and `deny` fields can both be left unconfigured, in which case the default crawler identification logic will be executed. Configuring the `allow` field can allow requests that would otherwise hit the default crawler identification logic. Configuring the `deny` field can add additional crawler identification logic.

-If field `allow` and field `deny` are not configured at the same time, the default logic to identify crawlers will be executed. By configuring the `allow` field, requests that would otherwise hit the default logic can be allowed. The judgement can be extended by configuring the `deny` field
-
-The default set of crawler judgment regular expressions is as follows：
+The default crawler identification regular expression set is as follows:

 ```bash
 # Bots General matcher 'name/0.0'
@@ -33,45 +37,23 @@ The default set of crawler judgment regular expressions is as follows：
    (CSimpleSpider|Cityreview Robot|CrawlDaddy|CrawlFire|Finderbots|Index crawler|Job Roboter|KiwiStatus Spider|Lijit Crawler|QuerySeekerSpider|ScollSpider|Trends Crawler|USyd-NLP-Spider|SiteCat Webbot|BotName\/\$BotVersion|123metaspider-Bot|1470\.net crawler|50\.nu|8bo Crawler Bot|Aboundex|Accoona-[A-z]{1,30}-Agent|AdsBot-Google(?:-[a-z]{1,30}|)|altavista|AppEngine-Google|archive.{0,30}\.org_bot|archiver|Ask Jeeves|[Bb]ai[Dd]u[Ss]pider(?:-[A-Za-z]{1,30})(?:-[A-Za-z]{1,30}|)|bingbot|BingPreview|blitzbot|BlogBridge|Bloglovin|BoardReader Blog Indexer|BoardReader Favicon Fetcher|boitho.com-dc|BotSeer|BUbiNG|\b\w{0,30}favicon\w{0,30}\b|\bYeti(?:-[a-z]{1,30}|)|Catchpoint(?: bot|)|[Cc]harlotte|Checklinks|clumboot|Comodo HTTP\(S\) Crawler|Comodo-Webinspector-Crawler|ConveraCrawler|CRAWL-E|CrawlConvera|Daumoa(?:-feedfetcher|)|Feed Seeker Bot|Feedbin|findlinks|Flamingo_SearchEngine|FollowSite Bot|furlbot|Genieo|gigabot|GomezAgent|gonzo1|(?:[a-zA-Z]{1,30}-|)Googlebot(?:-[a-zA-Z]{1,30}|)|Google SketchUp|grub-client|gsa-crawler|heritrix|HiddenMarket|holmes|HooWWWer|htdig|ia_archiver|ICC-Crawler|Icarus6j|ichiro(?:/mobile|)|IconSurf|IlTrovatore(?:-Setaccio|)|InfuzApp|Innovazion Crawler|InternetArchive|IP2[a-z]{1,30}Bot|jbot\b|KaloogaBot|Kraken|Kurzor|larbin|LEIA|LesnikBot|Linguee Bot|LinkAider|LinkedInBot|Lite Bot|Llaut|lycos|Mail\.RU_Bot|masscan|masidani_bot|Mediapartners-Google|Microsoft .{0,30} Bot|mogimogi|mozDex|MJ12bot|msnbot(?:-media {0,2}|)|msrbot|Mtps Feed Aggregation System|netresearch|Netvibes|NewsGator[^/]{0,30}|^NING|Nutch[^/]{0,30}|Nymesis|ObjectsSearch|OgScrper|Orbiter|OOZBOT|PagePeeker|PagesInventory|PaxleFramework|Peeplo Screenshot Bot|PlantyNet_WebRobot|Pompos|Qwantify|Read%20Later|Reaper|RedCarpet|Retreiver|Riddler|Rival IQ|scooter|Scrapy|Scrubby|searchsight|seekbot|semanticdiscovery|SemrushBot|Simpy|SimplePie|SEOstats|SimpleRSS|SiteCon|Slackbot-LinkExpanding|Slack-ImgProxy|Slurp|snappy|Speedy Spider|Squrl Java|Stringer|TheUsefulbot|ThumbShotsBot|Thumbshots\.ru|Tiny Tiny RSS|Twitterbot|WhatsApp|URL2PNG|Vagabondo|VoilaBot|^vortex|Votay bot|^voyager|WASALive.Bot|Web-sniffer|WebThumb|WeSEE:[A-z]{1,30}|WhatWeb|WIRE|WordPress|Wotbox|www\.almaden\.ibm\.com|Xenu(?:.s|) Link Sleuth|Xerka [A-z]{1,30}Bot|yacy(?:bot|)|YahooSeeker|Yahoo! Slurp|Yandex\w{1,30}|YodaoBot(?:-[A-z]{1,30}|)|YottaaMonitor|Yowedo|^Zao|^Zao-Crawler|ZeBot_www\.ze\.bz|ZooShot|ZyBorg)(?:[ /]v?(\d+)(?:\.(\d+)(?:\.(\d+)|)|)|)
 ```

-# Configuration Samples
-
-## Release Requests that would otherwise Hit the Crawler Rules
+## Configuration Example
+### Allowing Requests That Hit the Crawler Rules
 ```yaml
 allow:
 - ".*Go-http-client.*"
 ```

-Without this configuration, the default Golang web library request will be treated as a crawler and access will be denied.
+If this configuration is not made, requests from the default Golang network library will be treated as crawlers and blocked.

-
-## Add Crawler Judgement
+### Adding Crawler Identification
 ```yaml
 deny:
 - "spd-tools.*"
 ```

-According to this configuration, the following requests will be denied:
-
+With this configuration, the following requests will be blocked:
 ```bash
 curl http://example.com -H 'User-Agent: spd-tools/1.1'
 curl http://exmaple.com -H 'User-Agent: spd-tools'
 ```
-
-## Only Enabled for Specific Routes or Domains
-```yaml
-# Use _rules_ field for fine-grained rule configurations 
-_rules_:
-# Rule 1: Match by route name
- _match_route_:
-  - route-a
-  - route-b
-# Rule 2: Match by domain
- _match_domain_:
-  - "*.example.com"
-  - test.com
-  allow:
-  - ".*Go-http-client.*"
-```
-In the rule sample of `_match_route_`, `route-a` and `route-b` are the route names provided when creating a new gateway route. When the current route names matches the configuration, the rule following shall be applied.
-In the rule sample of `_match_domain_`, `*.example.com` and `test.com` are the domain names used for request matching. When the current domain name matches the configuration, the rule following shall be applied.
-All rules shall be checked following the order of items in the `_rules_` field, The first matched rule will be applied. All remained will be ignored.