update plugins doc (#1305)

This commit is contained in:
澄潭
2024-09-12 21:48:40 +08:00
committed by GitHub
parent 0f9113ed82
commit c7c4ae1da2
80 changed files with 7373 additions and 2368 deletions

View File

@@ -1,11 +1,21 @@
<p>
<a href="README_EN.md"> English </a> | 中文
</p>
---
title: Bot 拦截
keywords: [higress,bot detect]
description: Bot 拦截插件配置参考
---
## 功能说明
# 功能说明
`bot-detect`插件可以用于识别并阻止互联网爬虫对站点资源的爬取
# 配置字段
## 运行属性
插件执行阶段:`授权阶段`
插件执行优先级:`310`
## 配置字段
| 名称 | 数据类型 | 填写要求 | 默认值 | 描述 |
| -------- | -------- | -------- | -------- | -------- |
@@ -33,9 +43,9 @@
(CSimpleSpider|Cityreview Robot|CrawlDaddy|CrawlFire|Finderbots|Index crawler|Job Roboter|KiwiStatus Spider|Lijit Crawler|QuerySeekerSpider|ScollSpider|Trends Crawler|USyd-NLP-Spider|SiteCat Webbot|BotName\/\$BotVersion|123metaspider-Bot|1470\.net crawler|50\.nu|8bo Crawler Bot|Aboundex|Accoona-[A-z]{1,30}-Agent|AdsBot-Google(?:-[a-z]{1,30}|)|altavista|AppEngine-Google|archive.{0,30}\.org_bot|archiver|Ask Jeeves|[Bb]ai[Dd]u[Ss]pider(?:-[A-Za-z]{1,30})(?:-[A-Za-z]{1,30}|)|bingbot|BingPreview|blitzbot|BlogBridge|Bloglovin|BoardReader Blog Indexer|BoardReader Favicon Fetcher|boitho.com-dc|BotSeer|BUbiNG|\b\w{0,30}favicon\w{0,30}\b|\bYeti(?:-[a-z]{1,30}|)|Catchpoint(?: bot|)|[Cc]harlotte|Checklinks|clumboot|Comodo HTTP\(S\) Crawler|Comodo-Webinspector-Crawler|ConveraCrawler|CRAWL-E|CrawlConvera|Daumoa(?:-feedfetcher|)|Feed Seeker Bot|Feedbin|findlinks|Flamingo_SearchEngine|FollowSite Bot|furlbot|Genieo|gigabot|GomezAgent|gonzo1|(?:[a-zA-Z]{1,30}-|)Googlebot(?:-[a-zA-Z]{1,30}|)|Google SketchUp|grub-client|gsa-crawler|heritrix|HiddenMarket|holmes|HooWWWer|htdig|ia_archiver|ICC-Crawler|Icarus6j|ichiro(?:/mobile|)|IconSurf|IlTrovatore(?:-Setaccio|)|InfuzApp|Innovazion Crawler|InternetArchive|IP2[a-z]{1,30}Bot|jbot\b|KaloogaBot|Kraken|Kurzor|larbin|LEIA|LesnikBot|Linguee Bot|LinkAider|LinkedInBot|Lite Bot|Llaut|lycos|Mail\.RU_Bot|masscan|masidani_bot|Mediapartners-Google|Microsoft .{0,30} Bot|mogimogi|mozDex|MJ12bot|msnbot(?:-media {0,2}|)|msrbot|Mtps Feed Aggregation System|netresearch|Netvibes|NewsGator[^/]{0,30}|^NING|Nutch[^/]{0,30}|Nymesis|ObjectsSearch|OgScrper|Orbiter|OOZBOT|PagePeeker|PagesInventory|PaxleFramework|Peeplo Screenshot Bot|PlantyNet_WebRobot|Pompos|Qwantify|Read%20Later|Reaper|RedCarpet|Retreiver|Riddler|Rival IQ|scooter|Scrapy|Scrubby|searchsight|seekbot|semanticdiscovery|SemrushBot|Simpy|SimplePie|SEOstats|SimpleRSS|SiteCon|Slackbot-LinkExpanding|Slack-ImgProxy|Slurp|snappy|Speedy Spider|Squrl Java|Stringer|TheUsefulbot|ThumbShotsBot|Thumbshots\.ru|Tiny Tiny RSS|Twitterbot|WhatsApp|URL2PNG|Vagabondo|VoilaBot|^vortex|Votay bot|^voyager|WASALive.Bot|Web-sniffer|WebThumb|WeSEE:[A-z]{1,30}|WhatWeb|WIRE|WordPress|Wotbox|www\.almaden\.ibm\.com|Xenu(?:.s|) Link Sleuth|Xerka [A-z]{1,30}Bot|yacy(?:bot|)|YahooSeeker|Yahoo! Slurp|Yandex\w{1,30}|YodaoBot(?:-[A-z]{1,30}|)|YottaaMonitor|Yowedo|^Zao|^Zao-Crawler|ZeBot_www\.ze\.bz|ZooShot|ZyBorg)(?:[ /]v?(\d+)(?:\.(\d+)(?:\.(\d+)|)|)|)
```
# 配置示例
## 配置示例
## 放行原本命中爬虫规则的请求
### 放行原本命中爬虫规则的请求
```yaml
allow:
- ".*Go-http-client.*"
@@ -44,7 +54,7 @@ allow:
若不作该配置,默认的 Golang 网络库请求会被视做爬虫,被禁止访问
## 增加爬虫判断
### 增加爬虫判断
```yaml
deny:
- "spd-tools.*"
@@ -56,24 +66,3 @@ deny:
curl http://example.com -H 'User-Agent: spd-tools/1.1'
curl http://exmaple.com -H 'User-Agent: spd-tools'
```
## 对特定路由或域名开启
```yaml
# 使用 _rules_ 字段进行细粒度规则配置
_rules_:
# 规则一:按路由名称匹配生效
- _match_route_:
- route-a
- route-b
# 规则二:按域名匹配生效
- _match_domain_:
- "*.example.com"
- test.com
allow:
- ".*Go-http-client.*"
```
此例 `_match_route_` 中指定的 `route-a``route-b` 即在创建网关路由时填写的路由名称,当匹配到这两个路由时,将使用此段配置;
此例 `_match_domain_` 中指定的 `*.example.com``test.com` 用于匹配请求的域名,当发现域名匹配时,将使用此段配置;
配置的匹配生效顺序,将按照 `_rules_` 下规则的排列顺序,匹配第一个规则后生效对应配置,后续规则将被忽略。

View File

@@ -1,22 +1,26 @@
<p>
English | <a href="README.md">中文</a>
</p>
---
title: Bot Detect
keywords: [higress, bot detect]
description: Bot detect plugin configuration reference
---
## Function Description
The `bot-detect` plugin can be used to identify and block internet crawlers from accessing site resources.
# Description
`bot-detect` plugin can be used to identify and prevent web crawlers from crawling websites.
## Running Properties
Plugin Execution Phase: `Authorization Phase`
Plugin Execution Priority: `310`
# Configuration Fields
## Configuration Fields
| Name | Data Type | Required | Default Value | Description |
| ----------------- | ------------------- | --------------| --------------| ---------------------------------------------------------- |
| allow | array of string | Optional | - | Regular expressions to match the User-Agent request header; requests matching will be allowed to access. |
| deny | array of string | Optional | - | Regular expressions to match the User-Agent request header; requests matching will be blocked. |
| blocked_code | number | Optional | 403 | HTTP status code returned when a request is blocked. |
| blocked_message | string | Optional | - | HTTP response body returned when a request is blocked. |
| Name | Type | Requirement | Default Value | Description |
| -------- | -------- | -------- | -------- | -------- |
| allow | array of string | Optional | - | A regular expression to match the User-Agent request header and will allow access if the match hits |
| deny | array of string | Optional | - | A regular expression to match the User-Agent request header and will block the request if the match hits |
| blocked_code | number | Optional | 403 | The HTTP status code returned when a request is blocked |
| blocked_message | string | Optional | - | The HTTP response Body returned when a request is blocked |
The `allow` and `deny` fields can both be left unconfigured, in which case the default crawler identification logic will be executed. Configuring the `allow` field can allow requests that would otherwise hit the default crawler identification logic. Configuring the `deny` field can add additional crawler identification logic.
If field `allow` and field `deny` are not configured at the same time, the default logic to identify crawlers will be executed. By configuring the `allow` field, requests that would otherwise hit the default logic can be allowed. The judgement can be extended by configuring the `deny` field
The default set of crawler judgment regular expressions is as follows
The default crawler identification regular expression set is as follows:
```bash
# Bots General matcher 'name/0.0'
@@ -33,45 +37,23 @@ The default set of crawler judgment regular expressions is as follows
(CSimpleSpider|Cityreview Robot|CrawlDaddy|CrawlFire|Finderbots|Index crawler|Job Roboter|KiwiStatus Spider|Lijit Crawler|QuerySeekerSpider|ScollSpider|Trends Crawler|USyd-NLP-Spider|SiteCat Webbot|BotName\/\$BotVersion|123metaspider-Bot|1470\.net crawler|50\.nu|8bo Crawler Bot|Aboundex|Accoona-[A-z]{1,30}-Agent|AdsBot-Google(?:-[a-z]{1,30}|)|altavista|AppEngine-Google|archive.{0,30}\.org_bot|archiver|Ask Jeeves|[Bb]ai[Dd]u[Ss]pider(?:-[A-Za-z]{1,30})(?:-[A-Za-z]{1,30}|)|bingbot|BingPreview|blitzbot|BlogBridge|Bloglovin|BoardReader Blog Indexer|BoardReader Favicon Fetcher|boitho.com-dc|BotSeer|BUbiNG|\b\w{0,30}favicon\w{0,30}\b|\bYeti(?:-[a-z]{1,30}|)|Catchpoint(?: bot|)|[Cc]harlotte|Checklinks|clumboot|Comodo HTTP\(S\) Crawler|Comodo-Webinspector-Crawler|ConveraCrawler|CRAWL-E|CrawlConvera|Daumoa(?:-feedfetcher|)|Feed Seeker Bot|Feedbin|findlinks|Flamingo_SearchEngine|FollowSite Bot|furlbot|Genieo|gigabot|GomezAgent|gonzo1|(?:[a-zA-Z]{1,30}-|)Googlebot(?:-[a-zA-Z]{1,30}|)|Google SketchUp|grub-client|gsa-crawler|heritrix|HiddenMarket|holmes|HooWWWer|htdig|ia_archiver|ICC-Crawler|Icarus6j|ichiro(?:/mobile|)|IconSurf|IlTrovatore(?:-Setaccio|)|InfuzApp|Innovazion Crawler|InternetArchive|IP2[a-z]{1,30}Bot|jbot\b|KaloogaBot|Kraken|Kurzor|larbin|LEIA|LesnikBot|Linguee Bot|LinkAider|LinkedInBot|Lite Bot|Llaut|lycos|Mail\.RU_Bot|masscan|masidani_bot|Mediapartners-Google|Microsoft .{0,30} Bot|mogimogi|mozDex|MJ12bot|msnbot(?:-media {0,2}|)|msrbot|Mtps Feed Aggregation System|netresearch|Netvibes|NewsGator[^/]{0,30}|^NING|Nutch[^/]{0,30}|Nymesis|ObjectsSearch|OgScrper|Orbiter|OOZBOT|PagePeeker|PagesInventory|PaxleFramework|Peeplo Screenshot Bot|PlantyNet_WebRobot|Pompos|Qwantify|Read%20Later|Reaper|RedCarpet|Retreiver|Riddler|Rival IQ|scooter|Scrapy|Scrubby|searchsight|seekbot|semanticdiscovery|SemrushBot|Simpy|SimplePie|SEOstats|SimpleRSS|SiteCon|Slackbot-LinkExpanding|Slack-ImgProxy|Slurp|snappy|Speedy Spider|Squrl Java|Stringer|TheUsefulbot|ThumbShotsBot|Thumbshots\.ru|Tiny Tiny RSS|Twitterbot|WhatsApp|URL2PNG|Vagabondo|VoilaBot|^vortex|Votay bot|^voyager|WASALive.Bot|Web-sniffer|WebThumb|WeSEE:[A-z]{1,30}|WhatWeb|WIRE|WordPress|Wotbox|www\.almaden\.ibm\.com|Xenu(?:.s|) Link Sleuth|Xerka [A-z]{1,30}Bot|yacy(?:bot|)|YahooSeeker|Yahoo! Slurp|Yandex\w{1,30}|YodaoBot(?:-[A-z]{1,30}|)|YottaaMonitor|Yowedo|^Zao|^Zao-Crawler|ZeBot_www\.ze\.bz|ZooShot|ZyBorg)(?:[ /]v?(\d+)(?:\.(\d+)(?:\.(\d+)|)|)|)
```
# Configuration Samples
## Release Requests that would otherwise Hit the Crawler Rules
## Configuration Example
### Allowing Requests That Hit the Crawler Rules
```yaml
allow:
- ".*Go-http-client.*"
```
Without this configuration, the default Golang web library request will be treated as a crawler and access will be denied.
If this configuration is not made, requests from the default Golang network library will be treated as crawlers and blocked.
## Add Crawler Judgement
### Adding Crawler Identification
```yaml
deny:
- "spd-tools.*"
```
According to this configuration, the following requests will be denied:
With this configuration, the following requests will be blocked:
```bash
curl http://example.com -H 'User-Agent: spd-tools/1.1'
curl http://exmaple.com -H 'User-Agent: spd-tools'
```
## Only Enabled for Specific Routes or Domains
```yaml
# Use _rules_ field for fine-grained rule configurations
_rules_:
# Rule 1: Match by route name
- _match_route_:
- route-a
- route-b
# Rule 2: Match by domain
- _match_domain_:
- "*.example.com"
- test.com
allow:
- ".*Go-http-client.*"
```
In the rule sample of `_match_route_`, `route-a` and `route-b` are the route names provided when creating a new gateway route. When the current route names matches the configuration, the rule following shall be applied.
In the rule sample of `_match_domain_`, `*.example.com` and `test.com` are the domain names used for request matching. When the current domain name matches the configuration, the rule following shall be applied.
All rules shall be checked following the order of items in the `_rules_` field, The first matched rule will be applied. All remained will be ignored.