update plugins doc (#1305)

This commit is contained in:
澄潭
2024-09-12 21:48:40 +08:00
committed by GitHub
parent 0f9113ed82
commit c7c4ae1da2
80 changed files with 7373 additions and 2368 deletions

View File

@@ -1,22 +1,26 @@
<p>
English | <a href="README.md">中文</a>
</p>
---
title: Bot Detect
keywords: [higress, bot detect]
description: Bot detect plugin configuration reference
---
## Function Description
The `bot-detect` plugin can be used to identify and block internet crawlers from accessing site resources.
# Description
`bot-detect` plugin can be used to identify and prevent web crawlers from crawling websites.
## Running Properties
Plugin Execution Phase: `Authorization Phase`
Plugin Execution Priority: `310`
# Configuration Fields
## Configuration Fields
| Name | Data Type | Required | Default Value | Description |
| ----------------- | ------------------- | --------------| --------------| ---------------------------------------------------------- |
| allow | array of string | Optional | - | Regular expressions to match the User-Agent request header; requests matching will be allowed to access. |
| deny | array of string | Optional | - | Regular expressions to match the User-Agent request header; requests matching will be blocked. |
| blocked_code | number | Optional | 403 | HTTP status code returned when a request is blocked. |
| blocked_message | string | Optional | - | HTTP response body returned when a request is blocked. |
| Name | Type | Requirement | Default Value | Description |
| -------- | -------- | -------- | -------- | -------- |
| allow | array of string | Optional | - | A regular expression to match the User-Agent request header and will allow access if the match hits |
| deny | array of string | Optional | - | A regular expression to match the User-Agent request header and will block the request if the match hits |
| blocked_code | number | Optional | 403 | The HTTP status code returned when a request is blocked |
| blocked_message | string | Optional | - | The HTTP response Body returned when a request is blocked |
The `allow` and `deny` fields can both be left unconfigured, in which case the default crawler identification logic will be executed. Configuring the `allow` field can allow requests that would otherwise hit the default crawler identification logic. Configuring the `deny` field can add additional crawler identification logic.
If field `allow` and field `deny` are not configured at the same time, the default logic to identify crawlers will be executed. By configuring the `allow` field, requests that would otherwise hit the default logic can be allowed. The judgement can be extended by configuring the `deny` field
The default set of crawler judgment regular expressions is as follows
The default crawler identification regular expression set is as follows:
```bash
# Bots General matcher 'name/0.0'
@@ -33,45 +37,23 @@ The default set of crawler judgment regular expressions is as follows
(CSimpleSpider|Cityreview Robot|CrawlDaddy|CrawlFire|Finderbots|Index crawler|Job Roboter|KiwiStatus Spider|Lijit Crawler|QuerySeekerSpider|ScollSpider|Trends Crawler|USyd-NLP-Spider|SiteCat Webbot|BotName\/\$BotVersion|123metaspider-Bot|1470\.net crawler|50\.nu|8bo Crawler Bot|Aboundex|Accoona-[A-z]{1,30}-Agent|AdsBot-Google(?:-[a-z]{1,30}|)|altavista|AppEngine-Google|archive.{0,30}\.org_bot|archiver|Ask Jeeves|[Bb]ai[Dd]u[Ss]pider(?:-[A-Za-z]{1,30})(?:-[A-Za-z]{1,30}|)|bingbot|BingPreview|blitzbot|BlogBridge|Bloglovin|BoardReader Blog Indexer|BoardReader Favicon Fetcher|boitho.com-dc|BotSeer|BUbiNG|\b\w{0,30}favicon\w{0,30}\b|\bYeti(?:-[a-z]{1,30}|)|Catchpoint(?: bot|)|[Cc]harlotte|Checklinks|clumboot|Comodo HTTP\(S\) Crawler|Comodo-Webinspector-Crawler|ConveraCrawler|CRAWL-E|CrawlConvera|Daumoa(?:-feedfetcher|)|Feed Seeker Bot|Feedbin|findlinks|Flamingo_SearchEngine|FollowSite Bot|furlbot|Genieo|gigabot|GomezAgent|gonzo1|(?:[a-zA-Z]{1,30}-|)Googlebot(?:-[a-zA-Z]{1,30}|)|Google SketchUp|grub-client|gsa-crawler|heritrix|HiddenMarket|holmes|HooWWWer|htdig|ia_archiver|ICC-Crawler|Icarus6j|ichiro(?:/mobile|)|IconSurf|IlTrovatore(?:-Setaccio|)|InfuzApp|Innovazion Crawler|InternetArchive|IP2[a-z]{1,30}Bot|jbot\b|KaloogaBot|Kraken|Kurzor|larbin|LEIA|LesnikBot|Linguee Bot|LinkAider|LinkedInBot|Lite Bot|Llaut|lycos|Mail\.RU_Bot|masscan|masidani_bot|Mediapartners-Google|Microsoft .{0,30} Bot|mogimogi|mozDex|MJ12bot|msnbot(?:-media {0,2}|)|msrbot|Mtps Feed Aggregation System|netresearch|Netvibes|NewsGator[^/]{0,30}|^NING|Nutch[^/]{0,30}|Nymesis|ObjectsSearch|OgScrper|Orbiter|OOZBOT|PagePeeker|PagesInventory|PaxleFramework|Peeplo Screenshot Bot|PlantyNet_WebRobot|Pompos|Qwantify|Read%20Later|Reaper|RedCarpet|Retreiver|Riddler|Rival IQ|scooter|Scrapy|Scrubby|searchsight|seekbot|semanticdiscovery|SemrushBot|Simpy|SimplePie|SEOstats|SimpleRSS|SiteCon|Slackbot-LinkExpanding|Slack-ImgProxy|Slurp|snappy|Speedy Spider|Squrl Java|Stringer|TheUsefulbot|ThumbShotsBot|Thumbshots\.ru|Tiny Tiny RSS|Twitterbot|WhatsApp|URL2PNG|Vagabondo|VoilaBot|^vortex|Votay bot|^voyager|WASALive.Bot|Web-sniffer|WebThumb|WeSEE:[A-z]{1,30}|WhatWeb|WIRE|WordPress|Wotbox|www\.almaden\.ibm\.com|Xenu(?:.s|) Link Sleuth|Xerka [A-z]{1,30}Bot|yacy(?:bot|)|YahooSeeker|Yahoo! Slurp|Yandex\w{1,30}|YodaoBot(?:-[A-z]{1,30}|)|YottaaMonitor|Yowedo|^Zao|^Zao-Crawler|ZeBot_www\.ze\.bz|ZooShot|ZyBorg)(?:[ /]v?(\d+)(?:\.(\d+)(?:\.(\d+)|)|)|)
```
# Configuration Samples
## Release Requests that would otherwise Hit the Crawler Rules
## Configuration Example
### Allowing Requests That Hit the Crawler Rules
```yaml
allow:
- ".*Go-http-client.*"
```
Without this configuration, the default Golang web library request will be treated as a crawler and access will be denied.
If this configuration is not made, requests from the default Golang network library will be treated as crawlers and blocked.
## Add Crawler Judgement
### Adding Crawler Identification
```yaml
deny:
- "spd-tools.*"
```
According to this configuration, the following requests will be denied:
With this configuration, the following requests will be blocked:
```bash
curl http://example.com -H 'User-Agent: spd-tools/1.1'
curl http://exmaple.com -H 'User-Agent: spd-tools'
```
## Only Enabled for Specific Routes or Domains
```yaml
# Use _rules_ field for fine-grained rule configurations
_rules_:
# Rule 1: Match by route name
- _match_route_:
- route-a
- route-b
# Rule 2: Match by domain
- _match_domain_:
- "*.example.com"
- test.com
allow:
- ".*Go-http-client.*"
```
In the rule sample of `_match_route_`, `route-a` and `route-b` are the route names provided when creating a new gateway route. When the current route names matches the configuration, the rule following shall be applied.
In the rule sample of `_match_domain_`, `*.example.com` and `test.com` are the domain names used for request matching. When the current domain name matches the configuration, the rule following shall be applied.
All rules shall be checked following the order of items in the `_rules_` field, The first matched rule will be applied. All remained will be ignored.