docs: translate the document of bot-detect plugin to English. (#153)

2026-06-24 09:45:16 +08:00 · 2023-02-01 10:52:53 +08:00
parent 146b0a5135
commit e15f77e029
3 changed files with 82 additions and 1 deletions
--- a/plugins/wasm-cpp/extensions/bot_detect/README.md
+++ b/plugins/wasm-cpp/extensions/bot_detect/README.md
@@ -1,3 +1,7 @@
+<p>
+   <a href="README_EN.md"> English </a> | 中文
+</p>
+
 # 功能说明
 `bot-detect`插件可以用于识别并阻止互联网爬虫对站点资源的爬取

--- a/plugins/wasm-cpp/extensions/bot_detect/README_EN.md
+++ b/plugins/wasm-cpp/extensions/bot_detect/README_EN.md
@@ -0,0 +1,77 @@
+<p>
+   English | <a href="README.md">中文</a>
+</p>
+
+# Description
+`bot-detect` plugin can be used to identify and prevent web crawlers from crawling websites.
+
+# Configuration Fields
+
+| Name | Type | Requirement |  Default Value | Description |
+| -------- | -------- | -------- | -------- | -------- |
+|  allow     |  array of string     | Optional     |   -  |  A regular expression to match the User-Agent request header and will allow access if the match hits   |
+|  deny     |  array of string     | Optional     |   -  |  A regular expression to match the User-Agent request header and will block the request if the match hits   |
+|  blocked_code     |  number     | Optional     |   403  |  The HTTP status code returned when a request is blocked   |
+|  blocked_message     |  string     | Optional     |   -  |  The HTTP response Body returned when a request is blocked   |
+
+If field `allow` and field `deny` are not configured at the same time, the default logic to identify crawlers will be executed. By configuring the `allow` field, requests that would otherwise hit the default logic can be allowed. The judgement can be extended by configuring the `deny` field
+
+The default set of crawler judgment regular expressions is as follows：
+
+```bash
+# Bots General matcher 'name/0.0'
+    (?:\/[A-Za-z0-9\.]+|) {0,5}([A-Za-z0-9 \-_\!\[\]:]{0,50}(?:[Aa]rchiver|[Ii]ndexer|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]{0,50}))[/ ](\d+)(?:\.(\d+)(?:\.(\d+)|)|)
+# Bots General matcher 'name 0.0'
+    (?:\/[A-Za-z0-9\.]+|) {0,5}([A-Za-z0-9 \-_\!\[\]:]{0,50}(?:[Aa]rchiver|[Ii]ndexer|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]{0,50})) (\d+)(?:\.(\d+)(?:\.(\d+)|)|)
+# Bots containing spider|scrape|bot(but not CUBOT)|Crawl
+    ((?:[A-z0-9]{1,50}|[A-z\-]{1,50} ?|)(?: the |)(?:[Ss][Pp][Ii][Dd][Ee][Rr]|[Ss]crape|[Cc][Rr][Aa][Ww][Ll])[A-z0-9]{0,50})(?:(?:[ /]| v)(\d+)(?:\.(\d+)|)(?:\.(\d+)|)|)
+# Bots Pattern '/name-0.0'
+    /((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|Tailsweep)[ \-](\d+)(?:\.(\d+)(?:\.(\d+))?)?
+# Bots Pattern 'name/0.0'
+    \b(008|Altresium|Argus|BaiduMobaider|BoardReader|DNSGroup|DataparkSearch|EDI|Goodzer|Grub|INGRID|Infohelfer|LinkedInBot|LOOQ|Nutch|OgScrper|PathDefender|Peew|PostPost|Steeler|Twitterbot|VSE|WebCrunch|WebZIP|Y!J-BR[A-Z]|YahooSeeker|envolk|sproose|wminer)/(\d+)(?:\.(\d+)|)(?:\.(\d+)|)
+# More bots
+    (CSimpleSpider|Cityreview Robot|CrawlDaddy|CrawlFire|Finderbots|Index crawler|Job Roboter|KiwiStatus Spider|Lijit Crawler|QuerySeekerSpider|ScollSpider|Trends Crawler|USyd-NLP-Spider|SiteCat Webbot|BotName\/\$BotVersion|123metaspider-Bot|1470\.net crawler|50\.nu|8bo Crawler Bot|Aboundex|Accoona-[A-z]{1,30}-Agent|AdsBot-Google(?:-[a-z]{1,30}|)|altavista|AppEngine-Google|archive.{0,30}\.org_bot|archiver|Ask Jeeves|[Bb]ai[Dd]u[Ss]pider(?:-[A-Za-z]{1,30})(?:-[A-Za-z]{1,30}|)|bingbot|BingPreview|blitzbot|BlogBridge|Bloglovin|BoardReader Blog Indexer|BoardReader Favicon Fetcher|boitho.com-dc|BotSeer|BUbiNG|\b\w{0,30}favicon\w{0,30}\b|\bYeti(?:-[a-z]{1,30}|)|Catchpoint(?: bot|)|[Cc]harlotte|Checklinks|clumboot|Comodo HTTP\(S\) Crawler|Comodo-Webinspector-Crawler|ConveraCrawler|CRAWL-E|CrawlConvera|Daumoa(?:-feedfetcher|)|Feed Seeker Bot|Feedbin|findlinks|Flamingo_SearchEngine|FollowSite Bot|furlbot|Genieo|gigabot|GomezAgent|gonzo1|(?:[a-zA-Z]{1,30}-|)Googlebot(?:-[a-zA-Z]{1,30}|)|Google SketchUp|grub-client|gsa-crawler|heritrix|HiddenMarket|holmes|HooWWWer|htdig|ia_archiver|ICC-Crawler|Icarus6j|ichiro(?:/mobile|)|IconSurf|IlTrovatore(?:-Setaccio|)|InfuzApp|Innovazion Crawler|InternetArchive|IP2[a-z]{1,30}Bot|jbot\b|KaloogaBot|Kraken|Kurzor|larbin|LEIA|LesnikBot|Linguee Bot|LinkAider|LinkedInBot|Lite Bot|Llaut|lycos|Mail\.RU_Bot|masscan|masidani_bot|Mediapartners-Google|Microsoft .{0,30} Bot|mogimogi|mozDex|MJ12bot|msnbot(?:-media {0,2}|)|msrbot|Mtps Feed Aggregation System|netresearch|Netvibes|NewsGator[^/]{0,30}|^NING|Nutch[^/]{0,30}|Nymesis|ObjectsSearch|OgScrper|Orbiter|OOZBOT|PagePeeker|PagesInventory|PaxleFramework|Peeplo Screenshot Bot|PlantyNet_WebRobot|Pompos|Qwantify|Read%20Later|Reaper|RedCarpet|Retreiver|Riddler|Rival IQ|scooter|Scrapy|Scrubby|searchsight|seekbot|semanticdiscovery|SemrushBot|Simpy|SimplePie|SEOstats|SimpleRSS|SiteCon|Slackbot-LinkExpanding|Slack-ImgProxy|Slurp|snappy|Speedy Spider|Squrl Java|Stringer|TheUsefulbot|ThumbShotsBot|Thumbshots\.ru|Tiny Tiny RSS|Twitterbot|WhatsApp|URL2PNG|Vagabondo|VoilaBot|^vortex|Votay bot|^voyager|WASALive.Bot|Web-sniffer|WebThumb|WeSEE:[A-z]{1,30}|WhatWeb|WIRE|WordPress|Wotbox|www\.almaden\.ibm\.com|Xenu(?:.s|) Link Sleuth|Xerka [A-z]{1,30}Bot|yacy(?:bot|)|YahooSeeker|Yahoo! Slurp|Yandex\w{1,30}|YodaoBot(?:-[A-z]{1,30}|)|YottaaMonitor|Yowedo|^Zao|^Zao-Crawler|ZeBot_www\.ze\.bz|ZooShot|ZyBorg)(?:[ /]v?(\d+)(?:\.(\d+)(?:\.(\d+)|)|)|)
+```
+
+# Configuration Samples
+
+## Release Requests that would otherwise Hit the Crawler Rules
+```yaml
+allow:
+- ".*Go-http-client.*"
+```
+
+Without this configuration, the default Golang web library request will be treated as a crawler and access will be denied.
+
+
+## Add Crawler Judgement
+```yaml
+deny:
+- "spd-tools.*"
+```
+
+According to this configuration, the following requests will be denied:
+
+```bash
+curl http://example.com -H 'User-Agent: spd-tools/1.1'
+curl http://exmaple.com -H 'User-Agent: spd-tools'
+```
+
+## Only Enabled for Specific Routes or Domains
+```yaml
+# Use _rules_ field for fine-grained rule configurations 
+_rules_:
+# Rule 1: Match by route name
+- _match_route_:
+  - route-a
+  - route-b
+# Rule 2: Match by domain
+- _match_domain_:
+  - "*.example.com"
+  - test.com
+  allow:
+  - ".*Go-http-client.*"
+```
+In the rule sample of `_match_route_`, `route-a` and `route-b` are the route names provided when creating a new gateway route. When the current route names matches the configuration, the rule following shall be applied.
+In the rule sample of `_match_domain_`, `*.example.com` and `test.com` are the domain names used for request matching. When the current domain name matches the configuration, the rule following shall be applied.
+All rules shall be checked following the order of items in the `_rules_` field, The first matched rule will be applied. All remained will be ignored.
--- a/plugins/wasm-cpp/extensions/custom_response/README_EN.md
+++ b/plugins/wasm-cpp/extensions/custom_response/README_EN.md
@@ -59,7 +59,7 @@ So based on the 302 redirecting mechanism provided by browsers, this can redirec

 If you'd like to send other responses when rate-limited, please add other fields into the configuration, referring to the Mock Responses scenario.

-## Only Enable for Specific Routes or Domains
+## Only Enabled for Specific Routes or Domains
 ```yaml
 # Use _rules_ field for fine-grained rule configurations 
 _rules_: