Merge pull request #831 from ouzhuowei/fix_redis_and_proxy

适配没有redisKeys和快代理没有账号密码的情况
删除不必要的注释
2026-02-18 13:01:11 +08:00 · 2026-02-13 21:18:26 +08:00 · 2026-02-13 09:54:10 +08:00 · 2026-02-13 09:42:15 +08:00 · 2026-02-12 05:30:11 +08:00 · 2026-02-12 04:47:25 +08:00
32 changed files with 361 additions and 284 deletions
--- a/README.md
+++ b/README.md
@@ -1,19 +1,5 @@
 # 🔥 MediaCrawler - 自媒体平台爬虫 🕷️

-<div align="center" markdown="1">
-   <sup>Special thanks to:</sup>
-   <br>
-   <br>
-   <a href="https://go.warp.dev/MediaCrawler">
-      <img alt="Warp sponsorship" width="400" src="https://github.com/warpdotdev/brand-assets/blob/main/Github/Sponsor/Warp-Github-LG-02.png?raw=true">
-   </a>
-
-### [Warp is built for coding with multiple AI agents](https://go.warp.dev/MediaCrawler)
-
-
-</div>
-<hr>
-
 <div align="center">

 <a href="https://trendshift.io/repositories/8291" target="_blank">
@@ -67,14 +53,14 @@



-<details>
-<summary>🚀 <strong>MediaCrawlerPro 重磅发布！开源不易，欢迎订阅支持</strong></summary>
+<strong>MediaCrawlerPro 重磅发布！开源不易，欢迎订阅支持</strong>

 > 专注于学习成熟项目的架构设计，不仅仅是爬虫技术，Pro 版本的代码设计思路同样值得深入学习！

 [MediaCrawlerPro](https://github.com/MediaCrawlerPro) 相较于开源版本的核心优势：

 #### 🎯 核心功能升级
+- ✅ **自媒体内容拆解Agent**（新增功能）
 - ✅ **断点续爬功能**（重点特性）
 - ✅ **多账号 + IP代理池支持**（重点特性）
 - ✅ **去除 Playwright 依赖**，使用更简单
@@ -88,11 +74,10 @@
 #### 🎁 额外功能
 - ✅ **自媒体视频下载器桌面端**（适合学习全栈开发）
 - ✅ **多平台首页信息流推荐**（HomeFeed）
- [ ] **基于自媒体平台的AI Agent正在开发中 🚀🚀**
+- [ ] **基于评论分析AI Agent正在开发中 🚀🚀**

 点击查看：[MediaCrawlerPro 项目主页](https://github.com/MediaCrawlerPro) 更多介绍

-</details>


 ## 🚀 快速开始
@@ -150,8 +135,6 @@ uv run main.py --platform xhs --lt qrcode --type detail
 uv run main.py --help
 ```

-## WebUI支持
-
 <details>
 <summary>🖥️ <strong>WebUI 可视化操作界面</strong></summary>

@@ -247,20 +230,12 @@ MediaCrawler 支持多种数据存储方式，包括 CSV、JSON、Excel、SQLite
 [🚀 MediaCrawlerPro 重磅发布 🚀！更多的功能，更好的架构设计！开源不易，欢迎订阅支持！](https://github.com/MediaCrawlerPro)


-### 💬 交流群组
+## 💬 交流群组
 - **微信交流群**：[点击加入](https://nanmicoder.github.io/MediaCrawler/%E5%BE%AE%E4%BF%A1%E4%BA%A4%E6%B5%81%E7%BE%A4.html)
 - **B站账号**：[关注我](https://space.bilibili.com/434377496)，分享AI与爬虫技术知识


-### 💰 赞助商展示
-
-<a href="https://h.wandouip.com">
-<img src="docs/static/images/img_8.jpg">
-<br>
-豌豆HTTP自营千万级IP资源池，IP纯净度≥99.8%，每日保持IP高频更新，快速响应，稳定连接,满足多种业务场景，支持按需定制，注册免费提取10000ip。
-</a>
-
---
+## 💰 赞助商展示

 <a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
 <img width="500" src="docs/static/images/tikhub_banner_zh.png">
@@ -279,7 +254,7 @@ Thordata：可靠且经济高效的代理服务提供商。为企业和开发者
 <a href="https://www.thordata.com/products/residential-proxies/?ls=github&lk=mediacrawler">【住宅代理】</a> | <a href="https://www.thordata.com/products/web-scraper/?ls=github&lk=mediacrawler">【serp-api】</a>


-### 🤝 成为赞助者
+## 🤝 成为赞助者

 成为赞助者，可以将您的产品展示在这里，每天获得大量曝光！

@@ -288,7 +263,7 @@ Thordata：可靠且经济高效的代理服务提供商。为企业和开发者
 - 邮箱：`relakkes@gmail.com`
 ---

-### 📚 其他
+## 📚 其他
 - **常见问题**：[MediaCrawler 完整文档](https://nanmicoder.github.io/MediaCrawler/)
 - **爬虫入门教程**：[CrawlerTutorial 免费教程](https://github.com/NanmiCoder/CrawlerTutorial)
 - **新闻爬虫开源项目**：[NewsCrawlerCollection](https://github.com/NanmiCoder/NewsCrawlerCollection)
--- a/README_en.md
+++ b/README_en.md
@@ -1,16 +1,3 @@
-<div align="center" markdown="1">
-   <sup>Special thanks to:</sup>
-   <br>
-   <br>
-   <a href="https://go.warp.dev/MediaCrawler">
-      <img alt="Warp sponsorship" width="400" src="https://github.com/warpdotdev/brand-assets/blob/main/Github/Sponsor/Warp-Github-LG-02.png?raw=true">
-   </a>
-
-### [Warp is built for coding with multiple AI agents](https://go.warp.dev/MediaCrawler)
-
-
-</div>
-<hr>
 # 🔥 MediaCrawler - Social Media Platform Crawler 🕷️

 <div align="center">
@@ -60,16 +47,14 @@ A powerful **multi-platform social media data collection tool** that supports cr
 | Zhihu   | ✅          | ✅              | ✅        | ✅              | ✅          | ✅        | ✅              |


-<details id="pro-version">
-<summary>🔗 <strong>🚀 MediaCrawlerPro Major Release! More features, better architectural design!</strong></summary>
-
-### 🚀 MediaCrawlerPro Major Release!
+<strong>MediaCrawlerPro Major Release! Open source is not easy, welcome to subscribe and support!</strong>

 > Focus on learning mature project architectural design, not just crawling technology. The code design philosophy of the Pro version is equally worth in-depth study!

 [MediaCrawlerPro](https://github.com/MediaCrawlerPro) core advantages over the open-source version:

 #### 🎯 Core Feature Upgrades
+- ✅ **Content Deconstruction Agent** (New feature)
 - ✅ **Resume crawling functionality** (Key feature)
 - ✅ **Multi-account + IP proxy pool support** (Key feature)
 - ✅ **Remove Playwright dependency**, easier to use
@@ -83,10 +68,9 @@ A powerful **multi-platform social media data collection tool** that supports cr
 #### 🎁 Additional Features
 - ✅ **Social media video downloader desktop app** (suitable for learning full-stack development)
 - ✅ **Multi-platform homepage feed recommendations** (HomeFeed)
- [ ] **AI Agent based on social media platforms is under development 🚀🚀**
+- [ ] **AI Agent based on comment analysis is under development 🚀🚀**

 Click to view: [MediaCrawlerPro Project Homepage](https://github.com/MediaCrawlerPro) for more information
-</details>

 ## 🚀 Quick Start

@@ -252,14 +236,6 @@ MediaCrawler supports multiple data storage methods, including CSV, JSON, Excel,

 ### 💰 Sponsor Display

-<a href="https://h.wandouip.com">
-<img src="docs/static/images/img_8.jpg">
-<br>
-WandouHTTP - Self-operated tens of millions IP resource pool, IP purity ≥99.8%, daily high-frequency IP updates, fast response, stable connection, supports multiple business scenarios, customizable on demand, register to get 10000 free IPs.
-</a>
-
---
-
 <a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
 <img width="500" src="docs/static/images/tikhub_banner_zh.png">
 <br>
--- a/README_es.md
+++ b/README_es.md
@@ -1,17 +1,3 @@
-<div align="center" markdown="1">
-   <sup>Special thanks to:</sup>
-   <br>
-   <br>
-   <a href="https://go.warp.dev/MediaCrawler">
-      <img alt="Warp sponsorship" width="400" src="https://github.com/warpdotdev/brand-assets/blob/main/Github/Sponsor/Warp-Github-LG-02.png?raw=true">
-   </a>
-
-### [Warp is built for coding with multiple AI agents](https://go.warp.dev/MediaCrawler)
-
-
-</div>
-<hr>
-
 # 🔥 MediaCrawler - Rastreador de Plataformas de Redes Sociales 🕷️

 <div align="center">
@@ -61,16 +47,14 @@ Una poderosa **herramienta de recolección de datos de redes sociales multiplata
 | Zhihu   | ✅          | ✅              | ✅        | ✅              | ✅          | ✅        | ✅              |


-<details id="pro-version">
-<summary>🔗 <strong>🚀 ¡Lanzamiento Mayor de MediaCrawlerPro! ¡Más características, mejor diseño arquitectónico!</strong></summary>
-
-### 🚀 ¡Lanzamiento Mayor de MediaCrawlerPro!
+<strong>¡Lanzamiento Mayor de MediaCrawlerPro! ¡El código abierto no es fácil, bienvenido a suscribirse y apoyar!</strong>

 > Enfócate en aprender el diseño arquitectónico de proyectos maduros, no solo tecnología de rastreo. ¡La filosofía de diseño de código de la versión Pro también vale la pena estudiar en profundidad!

 [MediaCrawlerPro](https://github.com/MediaCrawlerPro) ventajas principales sobre la versión de código abierto:

 #### 🎯 Actualizaciones de Características Principales
+- ✅ **Agente de Deconstrucción de Contenido** (Nueva función)
 - ✅ **Funcionalidad de reanudación de rastreo** (Característica clave)
 - ✅ **Soporte de múltiples cuentas + pool de proxy IP** (Característica clave)
 - ✅ **Eliminar dependencia de Playwright**, más fácil de usar
@@ -84,10 +68,9 @@ Una poderosa **herramienta de recolección de datos de redes sociales multiplata
 #### 🎁 Características Adicionales
 - ✅ **Aplicación de escritorio descargadora de videos de redes sociales** (adecuada para aprender desarrollo full-stack)
 - ✅ **Recomendaciones de feed de página de inicio multiplataforma** (HomeFeed)
- [ ] **Agente AI basado en plataformas de redes sociales está en desarrollo 🚀🚀**
+- [ ] **Agente AI basado en análisis de comentarios está en desarrollo 🚀🚀**

 Haga clic para ver: [Página de Inicio del Proyecto MediaCrawlerPro](https://github.com/MediaCrawlerPro) para más información
-</details>

 ## 🚀 Inicio Rápido

@@ -253,14 +236,6 @@ MediaCrawler soporta múltiples métodos de almacenamiento de datos, incluyendo

 ### 💰 Exhibición de Patrocinadores

-<a href="https://h.wandouip.com">
-<img src="docs/static/images/img_8.jpg">
-<br>
-WandouHTTP - Pool de recursos IP auto-operado de decenas de millones, pureza de IP ≥99.8%, actualizaciones de IP de alta frecuencia diarias, respuesta rápida, conexión estable, soporta múltiples escenarios de negocio, personalizable según demanda, regístrese para obtener 10000 IPs gratis.
-</a>
-
---
-
 <a href="https://tikhub.io/?utm_source=github.com/NanmiCoder/MediaCrawler&utm_medium=marketing_social&utm_campaign=retargeting&utm_content=carousel_ad">
 <img width="500" src="docs/static/images/tikhub_banner_zh.png">
 <br>
--- a/cache/redis_cache.py
+++ b/cache/redis_cache.py
@@ -28,6 +28,7 @@ import time
 from typing import Any, List

 from redis import Redis
+from redis.exceptions import ResponseError

 from cache.abs_cache import AbstractCache
 from config import db_config
@@ -76,8 +77,25 @@ class RedisCache(AbstractCache):
    def keys(self, pattern: str) -> List[str]:
        """
        Get all keys matching the pattern
+        First try KEYS command, if not supported fallback to SCAN
        """
-        return [key.decode() for key in self._redis_client.keys(pattern)]
+        try:
+            # Try KEYS command first (faster for standard Redis)
+            return [key.decode() if isinstance(key, bytes) else key for key in self._redis_client.keys(pattern)]
+        except ResponseError as e:
+            # If KEYS is not supported (e.g., Redis Cluster or cloud Redis), use SCAN
+            if "unknown command" in str(e).lower() or "keys" in str(e).lower():
+                keys_list: List[str] = []
+                cursor = 0
+                while True:
+                    cursor, keys = self._redis_client.scan(cursor=cursor, match=pattern, count=100)
+                    keys_list.extend([key.decode() if isinstance(key, bytes) else key for key in keys])
+                    if cursor == 0:
+                        break
+                return keys_list
+            else:
+                # Re-raise if it's a different error
+                raise


 if __name__ == '__main__':
--- a/cmd_arg/arg.py
+++ b/cmd_arg/arg.py
@@ -266,12 +266,46 @@ async def parse_cmd(argv: Optional[Sequence[str]] = None):
                rich_help_panel="Performance Configuration",
            ),
        ] = config.MAX_CONCURRENCY_NUM,
+        save_data_path: Annotated[
+            str,
+            typer.Option(
+                "--save_data_path",
+                help="Data save path, default is empty and will save to data folder",
+                rich_help_panel="Storage Configuration",
+            ),
+        ] = config.SAVE_DATA_PATH,
+        enable_ip_proxy: Annotated[
+            str,
+            typer.Option(
+                "--enable_ip_proxy",
+                help="Whether to enable IP proxy, supports yes/true/t/y/1 or no/false/f/n/0",
+                rich_help_panel="Proxy Configuration",
+                show_default=True,
+            ),
+        ] = str(config.ENABLE_IP_PROXY),
+        ip_proxy_pool_count: Annotated[
+            int,
+            typer.Option(
+                "--ip_proxy_pool_count",
+                help="IP proxy pool count",
+                rich_help_panel="Proxy Configuration",
+            ),
+        ] = config.IP_PROXY_POOL_COUNT,
+        ip_proxy_provider_name: Annotated[
+            str,
+            typer.Option(
+                "--ip_proxy_provider_name",
+                help="IP proxy provider name (kuaidaili | wandouhttp)",
+                rich_help_panel="Proxy Configuration",
+            ),
+        ] = config.IP_PROXY_PROVIDER_NAME,
    ) -> SimpleNamespace:
        """MediaCrawler 命令行入口"""

        enable_comment = _to_bool(get_comment)
        enable_sub_comment = _to_bool(get_sub_comment)
        enable_headless = _to_bool(headless)
+        enable_ip_proxy_value = _to_bool(enable_ip_proxy)
        init_db_value = init_db.value if init_db else None

        # Parse specified_id and creator_id into lists
@@ -292,6 +326,10 @@ async def parse_cmd(argv: Optional[Sequence[str]] = None):
        config.COOKIES = cookies
        config.CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES = max_comments_count_singlenotes
        config.MAX_CONCURRENCY_NUM = max_concurrency_num
+        config.SAVE_DATA_PATH = save_data_path
+        config.ENABLE_IP_PROXY = enable_ip_proxy_value
+        config.IP_PROXY_POOL_COUNT = ip_proxy_pool_count
+        config.IP_PROXY_PROVIDER_NAME = ip_proxy_provider_name

        # Set platform-specific ID lists for detail/creator mode
        if specified_id_list:
--- a/config/base_config.py
+++ b/config/base_config.py
@@ -17,104 +17,107 @@
 # 详细许可条款请参阅项目根目录下的LICENSE文件。
 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。

-# 基础配置
-PLATFORM = "xhs"  # 平台，xhs | dy | ks | bili | wb | tieba | zhihu
-KEYWORDS = "编程副业,编程兼职"  # 关键词搜索配置，以英文逗号分隔
+# Basic configuration
+PLATFORM = "xhs"  # Platform, xhs | dy | ks | bili | wb | tieba | zhihu
+KEYWORDS = "编程副业,编程兼职"  # Keyword search configuration, separated by English commas
 LOGIN_TYPE = "qrcode"  # qrcode or phone or cookie
 COOKIES = ""
 CRAWLER_TYPE = (
-    "search"  # 爬取类型，search(关键词搜索) | detail(帖子详情)| creator(创作者主页数据)
+    "search"  # Crawling type, search (keyword search) | detail (post details) | creator (creator homepage data)
 )
-# 是否开启 IP 代理
+# Whether to enable IP proxy
 ENABLE_IP_PROXY = False

-# 代理IP池数量
+# Number of proxy IP pools
 IP_PROXY_POOL_COUNT = 2

-# 代理IP提供商名称
+# Proxy IP provider name
 IP_PROXY_PROVIDER_NAME = "kuaidaili"  # kuaidaili | wandouhttp

-# 设置为True不会打开浏览器（无头浏览器）
-# 设置False会打开一个浏览器
-# 小红书如果一直扫码登录不通过，打开浏览器手动过一下滑动验证码
-# 抖音如果一直提示失败，打开浏览器看下是否扫码登录之后出现了手机号验证，如果出现了手动过一下再试。
+# Setting to True will not open the browser (headless browser)
+# Setting False will open a browser
+# If Xiaohongshu keeps scanning the code to log in but fails, open the browser and manually pass the sliding verification code.
+# If Douyin keeps prompting failure, open the browser and see if mobile phone number verification appears after scanning the QR code to log in. If it does, manually go through it and try again.
 HEADLESS = False

-# 是否保存登录状态
+# Whether to save login status
 SAVE_LOGIN_STATE = True

-# ==================== CDP (Chrome DevTools Protocol) 配置 ====================
-# 是否启用CDP模式 - 使用用户现有的Chrome/Edge浏览器进行爬取，提供更好的反检测能力
-# 启用后将自动检测并启动用户的Chrome/Edge浏览器，通过CDP协议进行控制
-# 这种方式使用真实的浏览器环境，包括用户的扩展、Cookie和设置，大大降低被检测的风险
+# ==================== CDP (Chrome DevTools Protocol) Configuration ====================
+# Whether to enable CDP mode - use the user's existing Chrome/Edge browser to crawl, providing better anti-detection capabilities
+# Once enabled, the user's Chrome/Edge browser will be automatically detected and started, and controlled through the CDP protocol.
+# This method uses the real browser environment, including the user's extensions, cookies and settings, greatly reducing the risk of detection.
 ENABLE_CDP_MODE = True

-# CDP调试端口，用于与浏览器通信
-# 如果端口被占用，系统会自动尝试下一个可用端口
+# CDP debug port, used to communicate with the browser
+# If the port is occupied, the system will automatically try the next available port
 CDP_DEBUG_PORT = 9222

-# 自定义浏览器路径（可选）
-# 如果为空，系统会自动检测Chrome/Edge的安装路径
-# Windows示例: "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"
-# macOS示例: "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
+# Custom browser path (optional)
+# If it is empty, the system will automatically detect the installation path of Chrome/Edge
+# Windows example: "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"
+# macOS example: "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
 CUSTOM_BROWSER_PATH = ""

-# CDP模式下是否启用无头模式
-# 注意：即使设置为True，某些反检测功能在无头模式下可能效果不佳
+# Whether to enable headless mode in CDP mode
+# NOTE: Even if set to True, some anti-detection features may not work well in headless mode
 CDP_HEADLESS = False

-# 浏览器启动超时时间（秒）
+# Browser startup timeout (seconds)
 BROWSER_LAUNCH_TIMEOUT = 60

-# 是否在程序结束时自动关闭浏览器
-# 设置为False可以保持浏览器运行，便于调试
+# Whether to automatically close the browser when the program ends
+# Set to False to keep the browser running for easy debugging
 AUTO_CLOSE_BROWSER = True

-# 数据保存类型选项配置,支持六种类型：csv、db、json、sqlite、excel、postgres, 最好保存到DB，有排重的功能。
+# Data saving type option configuration, supports six types: csv, db, json, sqlite, excel, postgres. It is best to save to DB, with deduplication function.
 SAVE_DATA_OPTION = "json"  # csv or db or json or sqlite or excel or postgres

-# 用户浏览器缓存的浏览器文件配置
+# Data saving path, if not specified by default, it will be saved to the data folder.
+SAVE_DATA_PATH = ""
+
+# Browser file configuration cached by the user's browser
 USER_DATA_DIR = "%s_user_data_dir"  # %s will be replaced by platform name

-# 爬取开始页数 默认从第一页开始
+# The number of pages to start crawling starts from the first page by default
 START_PAGE = 1

-# 爬取视频/帖子的数量控制
+# Control the number of crawled videos/posts
 CRAWLER_MAX_NOTES_COUNT = 15

-# 并发爬虫数量控制
+# Controlling the number of concurrent crawlers
 MAX_CONCURRENCY_NUM = 1

-# 是否开启爬媒体模式（包含图片或视频资源），默认不开启爬媒体
+# Whether to enable crawling media mode (including image or video resources), crawling media is not enabled by default
 ENABLE_GET_MEIDAS = False

-# 是否开启爬评论模式, 默认开启爬评论
+# Whether to enable comment crawling mode. Comment crawling is enabled by default.
 ENABLE_GET_COMMENTS = True

-# 爬取一级评论的数量控制(单视频/帖子)
+# Control the number of crawled first-level comments (single video/post)
 CRAWLER_MAX_COMMENTS_COUNT_SINGLENOTES = 10

-# 是否开启爬二级评论模式, 默认不开启爬二级评论
-# 老版本项目使用了 db, 则需参考 schema/tables.sql line 287 增加表字段
+# Whether to enable the mode of crawling second-level comments. By default, crawling of second-level comments is not enabled.
+# If the old version of the project uses db, you need to refer to schema/tables.sql line 287 to add table fields.
 ENABLE_GET_SUB_COMMENTS = False

-# 词云相关
-# 是否开启生成评论词云图
+# word cloud related
+# Whether to enable generating comment word clouds
 ENABLE_GET_WORDCLOUD = False
-# 自定义词语及其分组
-# 添加规则：xx:yy 其中xx为自定义添加的词组，yy为将xx该词组分到的组名。
+# Custom words and their groups
+# Add rule: xx:yy where xx is a custom-added phrase, and yy is the group name to which the phrase xx is assigned.
 CUSTOM_WORDS = {
-    "零几": "年份",  # 将“零几”识别为一个整体
-    "高频词": "专业术语",  # 示例自定义词
+    "零几": "年份",  # Recognize "zero points" as a whole
+    "高频词": "专业术语",  # Example custom words
 }

-# 停用(禁用)词文件路径
+# Deactivate (disabled) word file path
 STOP_WORDS_FILE = "./docs/hit_stopwords.txt"

-# 中文字体文件路径
+# Chinese font file path
 FONT_PATH = "./docs/STZHONGS.TTF"

-# 爬取间隔时间
+# Crawl interval
 CRAWLER_MAX_SLEEP_SEC = 2

 from .bilibili_config import *
--- a/config/bilibili_config.py
+++ b/config/bilibili_config.py
@@ -16,15 +16,15 @@
 #
 # 详细许可条款请参阅项目根目录下的LICENSE文件。
 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。
-# bilili 平台配置
+# bilili platform configuration

-# 每天爬取视频/帖子的数量控制
+# Control the number of videos/posts crawled per day
 MAX_NOTES_PER_DAY = 1

-# 指定B站视频URL列表 (支持完整URL或BV号)
-# 示例:
-# - 完整URL: "https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click"
-# - BV号: "BV1d54y1g7db"
+# Specify Bilibili video URL list (supports complete URL or BV number)
+# Example:
+# - Full URL: "https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click"
+# - BV number: "BV1d54y1g7db"
 BILI_SPECIFIED_ID_LIST = [
    "https://www.bilibili.com/video/BV1dwuKzmE26/?spm_id_from=333.1387.homepage.video_card.click",
    "BV1Sz4y1U77N",
@@ -32,9 +32,9 @@ BILI_SPECIFIED_ID_LIST = [
    # ........................
 ]

-# 指定B站创作者URL列表 (支持完整URL或UID)
-# 示例:
-# - 完整URL: "https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0"
+# Specify the URL list of Bilibili creators (supports full URL or UID)
+# Example:
+# - Full URL: "https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0"
 # - UID: "20813884"
 BILI_CREATOR_ID_LIST = [
    "https://space.bilibili.com/434377496?spm_id_from=333.1007.0.0",
@@ -42,26 +42,26 @@ BILI_CREATOR_ID_LIST = [
    # ........................
 ]

-# 指定时间范围
+# Specify time range
 START_DAY = "2024-01-01"
 END_DAY = "2024-01-01"

-# 搜索模式
+# Search mode
 BILI_SEARCH_MODE = "normal"

-# 视频清晰度（qn）配置，常见取值：
-# 16=360p, 32=480p, 64=720p, 80=1080p, 112=1080p高码率, 116=1080p60, 120=4K
-# 注意：更高清晰度需要账号/视频本身支持
+# Video definition (qn) configuration, common values:
+# 16=360p, 32=480p, 64=720p, 80=1080p, 112=1080p high bit rate, 116=1080p60, 120=4K
+# Note: Higher definition requires account/video support
 BILI_QN = 80

-# 是否爬取用户信息
+# Whether to crawl user information
 CREATOR_MODE = True

-# 开始爬取用户信息页码
+# Start crawling user information page number
 START_CONTACTS_PAGE = 1

-# 单个视频/帖子最大爬取评论数
+# Maximum number of crawled comments for a single video/post
 CRAWLER_MAX_CONTACTS_COUNT_SINGLENOTES = 100

-# 单个视频/帖子最大爬取动态数
+# Maximum number of crawled dynamics for a single video/post
 CRAWLER_MAX_DYNAMICS_COUNT_SINGLENOTES = 50
--- a/config/dy_config.py
+++ b/config/dy_config.py
@@ -17,16 +17,16 @@
 # 详细许可条款请参阅项目根目录下的LICENSE文件。
 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。

-# 抖音平台配置
+# Douyin platform configuration
 PUBLISH_TIME_TYPE = 0

-# 指定DY视频URL列表 (支持多种格式)
-# 支持格式:
-# 1. 完整视频URL: "https://www.douyin.com/video/7525538910311632128"
-# 2. 带modal_id的URL: "https://www.douyin.com/user/xxx?modal_id=7525538910311632128"
-# 3. 搜索页带modal_id: "https://www.douyin.com/root/search/python?modal_id=7525538910311632128"
-# 4. 短链接: "https://v.douyin.com/drIPtQ_WPWY/"
-# 5. 纯视频ID: "7280854932641664319"
+# Specify DY video URL list (supports multiple formats)
+# Supported formats:
+# 1. Full video URL: "https://www.douyin.com/video/7525538910311632128"
+# 2. URL with modal_id: "https://www.douyin.com/user/xxx?modal_id=7525538910311632128"
+# 3. The search page has modal_id: "https://www.douyin.com/root/search/python?modal_id=7525538910311632128"
+# 4. Short link: "https://v.douyin.com/drIPtQ_WPWY/"
+# 5. Pure video ID: "7280854932641664319"
 DY_SPECIFIED_ID_LIST = [
    "https://www.douyin.com/video/7525538910311632128",
    "https://v.douyin.com/drIPtQ_WPWY/",
@@ -35,9 +35,9 @@ DY_SPECIFIED_ID_LIST = [
    # ........................
 ]

-# 指定DY创作者URL列表 (支持完整URL或sec_user_id)
-# 支持格式:
-# 1. 完整创作者主页URL: "https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main"
+# Specify DY creator URL list (supports full URL or sec_user_id)
+# Supported formats:
+# 1. Complete creator homepage URL: "https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main"
 # 2. sec_user_id: "MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE"
 DY_CREATOR_ID_LIST = [
    "https://www.douyin.com/user/MS4wLjABAAAATJPY7LAlaa5X-c8uNdWkvz0jUGgpw4eeXIwu_8BhvqE?from_tab_name=main",
--- a/config/ks_config.py
+++ b/config/ks_config.py
@@ -17,22 +17,22 @@
 # 详细许可条款请参阅项目根目录下的LICENSE文件。
 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。

-# 快手平台配置
+# Kuaishou platform configuration

-# 指定快手视频URL列表 (支持完整URL或纯ID)
-# 支持格式:
-# 1. 完整视频URL: "https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search"
-# 2. 纯视频ID: "3xf8enb8dbj6uig"
+# Specify Kuaishou video URL list (supports complete URL or pure ID)
+# Supported formats:
+# 1. Full video URL: "https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search"
+# 2. Pure video ID: "3xf8enb8dbj6uig"
 KS_SPECIFIED_ID_LIST = [
    "https://www.kuaishou.com/short-video/3x3zxz4mjrsc8ke?authorId=3x84qugg4ch9zhs&streamSource=search&area=searchxxnull&searchKey=python",
    "3xf8enb8dbj6uig",
    # ........................
 ]

-# 指定快手创作者URL列表 (支持完整URL或纯ID)
-# 支持格式:
-# 1. 创作者主页URL: "https://www.kuaishou.com/profile/3x84qugg4ch9zhs"
-# 2. 纯user_id: "3x4sm73aye7jq7i"
+# Specify Kuaishou creator URL list (supports full URL or pure ID)
+# Supported formats:
+# 1. Creator homepage URL: "https://www.kuaishou.com/profile/3x84qugg4ch9zhs"
+# 2. Pure user_id: "3x4sm73aye7jq7i"
 KS_CREATOR_ID_LIST = [
    "https://www.kuaishou.com/profile/3x84qugg4ch9zhs",
    "3x4sm73aye7jq7i",
--- a/config/tieba_config.py
+++ b/config/tieba_config.py
@@ -17,17 +17,17 @@
 # 详细许可条款请参阅项目根目录下的LICENSE文件。
 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。

-# 贴吧平台配置
+# Tieba platform configuration

-# 指定贴吧ID列表
+# Specify Tieba ID list
 TIEBA_SPECIFIED_ID_LIST = []

-# 指定贴吧名称列表
+# Specify a list of Tieba names
 TIEBA_NAME_LIST = [
-    # "盗墓笔记"
+    # "Tomb Robbery Notes"
 ]

-# 指定贴吧用户URL列表
+# Specify Tieba user URL list
 TIEBA_CREATOR_URL_LIST = [
    "https://tieba.baidu.com/home/main/?id=tb.1.7f139e2e.6CyEwxu3VJruH_-QqpCi6g&fr=frs",
    # ........................
--- a/config/weibo_config.py
+++ b/config/weibo_config.py
@@ -18,23 +18,23 @@
 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。


-# 微博平台配置
+# Weibo platform configuration

-# 搜索类型，具体的枚举值在media_platform/weibo/field.py中
+# Search type, the specific enumeration value is in media_platform/weibo/field.py
 WEIBO_SEARCH_TYPE = "default"

-# 指定微博ID列表
+# Specify Weibo ID list
 WEIBO_SPECIFIED_ID_LIST = [
    "4982041758140155",
    # ........................
 ]

-# 指定微博用户ID列表
+# Specify Weibo user ID list
 WEIBO_CREATOR_ID_LIST = [
    "5756404150",
    # ........................
 ]

-# 是否开启微博爬取全文的功能，默认开启
-# 如果开启的话会增加被风控的概率，相当于一个关键词搜索请求会再遍历所有帖子的时候，再请求一次帖子详情
+# Whether to enable the function of crawling the full text of Weibo. It is enabled by default.
+# If turned on, it will increase the probability of being risk controlled, which is equivalent to a keyword search request that will traverse all posts and request the post details again.
 ENABLE_WEIBO_FULL_TEXT = True
--- a/config/xhs_config.py
+++ b/config/xhs_config.py
@@ -18,18 +18,18 @@
 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。


-# 小红书平台配置
+# Xiaohongshu platform configuration

-# 排序方式，具体的枚举值在media_platform/xhs/field.py中
+# Sorting method, the specific enumeration value is in media_platform/xhs/field.py
 SORT_TYPE = "popularity_descending"

-# 指定笔记URL列表, 必须要携带xsec_token参数
+# Specify the note URL list, which must carry the xsec_token parameter
 XHS_SPECIFIED_NOTE_URL_LIST = [
    "https://www.xiaohongshu.com/explore/64b95d01000000000c034587?xsec_token=AB0EFqJvINCkj6xOCKCQgfNNh8GdnBC_6XecG4QOddo3Q=&xsec_source=pc_cfeed"
    # ........................
 ]

-# 指定创作者URL列表，需要携带xsec_token和xsec_source参数
+# Specify the creator URL list, which needs to carry xsec_token and xsec_source parameters.

 XHS_CREATOR_ID_LIST = [
    "https://www.xiaohongshu.com/user/profile/5f58bd990000000001003753?xsec_token=ABYVg1evluJZZzpMX-VWzchxQ1qSNVW3r-jOEnKqMcgZw=&xsec_source=pc_search"
--- a/config/zhihu_config.py
+++ b/config/zhihu_config.py
@@ -18,17 +18,17 @@
 # 使用本代码即表示您同意遵守上述原则和LICENSE中的所有条款。


-# 知乎平台配置
+# Zhihu platform configuration

-# 指定知乎用户URL列表
+# Specify Zhihu user URL list
 ZHIHU_CREATOR_URL_LIST = [
    "https://www.zhihu.com/people/yd1234567",
    # ........................
 ]

-# 指定知乎ID列表
+# Specify Zhihu ID list
 ZHIHU_SPECIFIED_ID_LIST = [
-    "https://www.zhihu.com/question/826896610/answer/4885821440",  # 回答
-    "https://zhuanlan.zhihu.com/p/673461588",  # 文章
-    "https://www.zhihu.com/zvideo/1539542068422144000",  # 视频
+    "https://www.zhihu.com/question/826896610/answer/4885821440",  # answer
+    "https://zhuanlan.zhihu.com/p/673461588",  # article
+    "https://www.zhihu.com/zvideo/1539542068422144000",  # video
 ]
--- a/media_platform/bilibili/core.py
+++ b/media_platform/bilibili/core.py
@@ -474,7 +474,7 @@ class BilibiliCrawler(AbstractCrawler):
            },
            playwright_page=self.context_page,
            cookie_dict=cookie_dict,
-            proxy_ip_pool=self.ip_proxy_pool,  # 传递代理池用于自动刷新
+            proxy_ip_pool=self.ip_proxy_pool,  # Pass proxy pool for automatic refresh
        )
        return bilibili_client_obj

--- a/media_platform/douyin/client.py
+++ b/media_platform/douyin/client.py
@@ -43,7 +43,7 @@ class DouYinClient(AbstractApiClient, ProxyRefreshMixin):

    def __init__(
        self,
-        timeout=60,  # 若开启爬取媒体选项，抖音的短视频需要更久的超时时间
+        timeout=60,  # If the crawl media option is turned on, Douyin’s short videos will require a longer timeout.
        proxy=None,
        *,
        headers: Dict,
@@ -57,7 +57,7 @@ class DouYinClient(AbstractApiClient, ProxyRefreshMixin):
        self._host = "https://www.douyin.com"
        self.playwright_page = playwright_page
        self.cookie_dict = cookie_dict
-        # 初始化代理池（来自 ProxyRefreshMixin）
+        # Initialize proxy pool (from ProxyRefreshMixin)
        self.init_proxy_pool(proxy_ip_pool)

    async def __process_req_params(
@@ -103,7 +103,7 @@ class DouYinClient(AbstractApiClient, ProxyRefreshMixin):
        params.update(common_params)
        query_string = urllib.parse.urlencode(params)

-        # 20240927 a-bogus更新（JS版本）
+        # 20240927 a-bogus update (JS version)
        post_data = {}
        if request_method == "POST":
            post_data = params
@@ -113,7 +113,7 @@ class DouYinClient(AbstractApiClient, ProxyRefreshMixin):
            params["a_bogus"] = a_bogus

    async def request(self, method, url, **kwargs):
-        # 每次请求前检测代理是否过期
+        # Check whether the proxy has expired before each request
        await self._refresh_proxy_if_expired()

        async with httpx.AsyncClient(proxy=self.proxy) as client:
@@ -266,13 +266,13 @@ class DouYinClient(AbstractApiClient, ProxyRefreshMixin):
            if len(result) + len(comments) > max_count:
                comments = comments[:max_count - len(result)]
            result.extend(comments)
-            if callback:  # 如果有回调函数，就执行回调函数
+            if callback:  # If there is a callback function, execute the callback function
                await callback(aweme_id, comments)

            await asyncio.sleep(crawl_interval)
            if not is_fetch_sub_comments:
                continue
-            # 获取二级评论
+            # Get secondary reviews
            for comment in comments:
                reply_comment_total = comment.get("reply_comment_total")

@@ -290,7 +290,7 @@ class DouYinClient(AbstractApiClient, ProxyRefreshMixin):
                        if not sub_comments:
                            continue
                        result.extend(sub_comments)
-                        if callback:  # 如果有回调函数，就执行回调函数
+                        if callback:  # If there is a callback function, execute the callback function
                            await callback(aweme_id, sub_comments)
                        await asyncio.sleep(crawl_interval)
        return result
@@ -343,7 +343,7 @@ class DouYinClient(AbstractApiClient, ProxyRefreshMixin):
                else:
                    return response.content
            except httpx.HTTPError as exc:  # some wrong when call httpx.request method, such as connection error, client error, server error or response status code is not 2xx
-                utils.logger.error(f"[DouYinClient.get_aweme_media] {exc.__class__.__name__} for {exc.request.url} - {exc}")  # 保留原始异常类型名称，以便开发者调试
+                utils.logger.error(f"[DouYinClient.get_aweme_media] {exc.__class__.__name__} for {exc.request.url} - {exc}")  # Keep the original exception type name for developers to debug
                return None

    async def resolve_short_url(self, short_url: str) -> str:
@@ -359,7 +359,7 @@ class DouYinClient(AbstractApiClient, ProxyRefreshMixin):
                utils.logger.info(f"[DouYinClient.resolve_short_url] Resolving short URL: {short_url}")
                response = await client.get(short_url, timeout=10)

-                # 短链接通常返回302重定向
+                # Short links usually return a 302 redirect
                if response.status_code in [301, 302, 303, 307, 308]:
                    redirect_url = response.headers.get("Location", "")
                    utils.logger.info(f"[DouYinClient.resolve_short_url] Resolved to: {redirect_url}")
--- a/media_platform/douyin/core.py
+++ b/media_platform/douyin/core.py
@@ -55,7 +55,7 @@ class DouYinCrawler(AbstractCrawler):
    def __init__(self) -> None:
        self.index_url = "https://www.douyin.com"
        self.cdp_manager = None
-        self.ip_proxy_pool = None  # 代理IP池，用于代理自动刷新
+        self.ip_proxy_pool = None  # Proxy IP pool for automatic proxy refresh

    async def start(self) -> None:
        playwright_proxy_format, httpx_proxy_format = None, None
@@ -65,7 +65,7 @@ class DouYinCrawler(AbstractCrawler):
            playwright_proxy_format, httpx_proxy_format = utils.format_proxy_info(ip_proxy_info)

        async with async_playwright() as playwright:
-            # 根据配置选择启动模式
+            # Select startup mode based on configuration
            if config.ENABLE_CDP_MODE:
                utils.logger.info("[DouYinCrawler] 使用CDP模式启动浏览器")
                self.browser_context = await self.launch_browser_with_cdp(
@@ -178,12 +178,12 @@ class DouYinCrawler(AbstractCrawler):
            try:
                video_info = parse_video_info_from_url(video_url)

-                # 处理短链接
+                # Handling short links
                if video_info.url_type == "short":
                    utils.logger.info(f"[DouYinCrawler.get_specified_awemes] Resolving short link: {video_url}")
                    resolved_url = await self.dy_client.resolve_short_url(video_url)
                    if resolved_url:
-                        # 从解析后的URL中提取视频ID
+                        # Extract video ID from parsed URL
                        video_info = parse_video_info_from_url(resolved_url)
                        utils.logger.info(f"[DouYinCrawler.get_specified_awemes] Short link resolved to aweme ID: {video_info.aweme_id}")
                    else:
@@ -240,7 +240,7 @@ class DouYinCrawler(AbstractCrawler):
    async def get_comments(self, aweme_id: str, semaphore: asyncio.Semaphore) -> None:
        async with semaphore:
            try:
-                # 将关键词列表传递给 get_aweme_all_comments 方法
+                # Pass the list of keywords to the get_aweme_all_comments method
                # Use fixed crawling interval
                crawl_interval = config.CRAWLER_MAX_SLEEP_SEC
                await self.dy_client.get_aweme_all_comments(
@@ -311,7 +311,7 @@ class DouYinCrawler(AbstractCrawler):
            },
            playwright_page=self.context_page,
            cookie_dict=cookie_dict,
-            proxy_ip_pool=self.ip_proxy_pool,  # 传递代理池用于自动刷新
+            proxy_ip_pool=self.ip_proxy_pool,  # Pass proxy pool for automatic refresh
        )
        return douyin_client

@@ -361,10 +361,10 @@ class DouYinCrawler(AbstractCrawler):
                headless=headless,
            )

-            # 添加反检测脚本
+            # Add anti-detection script
            await self.cdp_manager.add_stealth_script()

-            # 显示浏览器信息
+            # Show browser information
            browser_info = await self.cdp_manager.get_browser_info()
            utils.logger.info(f"[DouYinCrawler] CDP浏览器信息: {browser_info}")

@@ -372,13 +372,13 @@ class DouYinCrawler(AbstractCrawler):

        except Exception as e:
            utils.logger.error(f"[DouYinCrawler] CDP模式启动失败，回退到标准模式: {e}")
-            # 回退到标准模式
+            # Fall back to standard mode
            chromium = playwright.chromium
            return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)

    async def close(self) -> None:
        """Close browser context"""
-        # 如果使用CDP模式，需要特殊处理
+        # If you use CDP mode, special processing is required
        if self.cdp_manager:
            await self.cdp_manager.cleanup()
            self.cdp_manager = None
@@ -396,11 +396,11 @@ class DouYinCrawler(AbstractCrawler):
        if not config.ENABLE_GET_MEIDAS:
            utils.logger.info(f"[DouYinCrawler.get_aweme_media] Crawling image mode is not enabled")
            return
-        # 笔记 urls 列表，若为短视频类型则返回为空列表
+        # List of note urls. If it is a short video type, an empty list will be returned.
        note_download_url: List[str] = douyin_store._extract_note_image_list(aweme_item)
-        # 视频 url，永远存在，但为短视频类型时的文件其实是音频文件
+        # The video URL will always exist, but when it is a short video type, the file is actually an audio file.
        video_download_url: str = douyin_store._extract_video_download_url(aweme_item)
-        # TODO: 抖音并没采用音视频分离的策略，故音频可从原视频中分离，暂不提取
+        # TODO: Douyin does not adopt the audio and video separation strategy, so the audio can be separated from the original video and will not be extracted for the time being.
        if note_download_url:
            await self.get_aweme_images(aweme_item)
        else:
@@ -416,7 +416,7 @@ class DouYinCrawler(AbstractCrawler):
        if not config.ENABLE_GET_MEIDAS:
            return
        aweme_id = aweme_item.get("aweme_id")
-        # 笔记 urls 列表，若为短视频类型则返回为空列表
+        # List of note urls. If it is a short video type, an empty list will be returned.
        note_download_url: List[str] = douyin_store._extract_note_image_list(aweme_item)

        if not note_download_url:
@@ -444,7 +444,7 @@ class DouYinCrawler(AbstractCrawler):
            return
        aweme_id = aweme_item.get("aweme_id")

-        # 视频 url，永远存在，但为短视频类型时的文件其实是音频文件
+        # The video URL will always exist, but when it is a short video type, the file is actually an audio file.
        video_download_url: str = douyin_store._extract_video_download_url(aweme_item)

        if not video_download_url:
--- a/media_platform/douyin/help.py
+++ b/media_platform/douyin/help.py
@@ -20,7 +20,7 @@

 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
-# @Name    : 程序员阿江-Relakkes
+# @Name: Programmer Ajiang-Relakkes
 # @Time    : 2024/6/10 02:24
 # @Desc    : Get a_bogus parameter, for learning and communication only, do not use for commercial purposes, contact author to delete if infringement

--- a/media_platform/douyin/login.py
+++ b/media_platform/douyin/login.py
@@ -191,7 +191,7 @@ class DouYinLogin(AbstractLogin):
                await self.move_slider(back_selector, gap_selector, move_step, slider_level)
                await asyncio.sleep(1)

-                # If the slider is too slow or verification failed, it will prompt "操作过慢", click the refresh button here
+                # If the slider is too slow or verification failed, it will prompt "The operation is too slow", click the refresh button here
                page_content = await self.context_page.content()
                if "操作过慢" in page_content or "提示重新操作" in page_content:
                    utils.logger.info("[DouYinLogin.check_page_display_slider] slider verify failed, retry ...")
--- a/media_platform/xhs/client.py
+++ b/media_platform/xhs/client.py
@@ -24,7 +24,7 @@ from urllib.parse import urlencode

 import httpx
 from playwright.async_api import BrowserContext, Page
-from tenacity import retry, stop_after_attempt, wait_fixed
+from tenacity import retry, stop_after_attempt, wait_fixed, retry_if_not_exception_type

 import config
 from base.base_crawler import AbstractApiClient
@@ -34,7 +34,7 @@ from tools import utils
 if TYPE_CHECKING:
    from proxy.proxy_ip_pool import ProxyIpPool

-from .exception import DataFetchError, IPBlockError
+from .exception import DataFetchError, IPBlockError, NoteNotFoundError
 from .field import SearchNoteType, SearchSortType
 from .help import get_search_id
 from .extractor import XiaoHongShuExtractor
@@ -60,6 +60,7 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
        self._domain = "https://www.xiaohongshu.com"
        self.IP_ERROR_STR = "Network connection error, please check network settings or restart"
        self.IP_ERROR_CODE = 300012
+        self.NOTE_NOT_FOUND_CODE = -510000
        self.NOTE_ABNORMAL_STR = "Note status abnormal, please check later"
        self.NOTE_ABNORMAL_CODE = -510001
        self.playwright_page = playwright_page
@@ -109,7 +110,7 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
        self.headers.update(headers)
        return self.headers

-    @retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
+    @retry(stop=stop_after_attempt(3), wait=wait_fixed(1), retry=retry_if_not_exception_type(NoteNotFoundError))
    async def request(self, method, url, **kwargs) -> Union[str, Any]:
        """
        Wrapper for httpx common request method, processes request response
@@ -144,6 +145,8 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
            return data.get("data", data.get("success", {}))
        elif data["code"] == self.IP_ERROR_CODE:
            raise IPBlockError(self.IP_ERROR_STR)
+        elif data["code"] in (self.NOTE_NOT_FOUND_CODE, self.NOTE_ABNORMAL_CODE):
+            raise NoteNotFoundError(f"Note not found or abnormal, code: {data['code']}")
        else:
            err_msg = data.get("msg", None) or f"{response.text}"
            raise DataFetchError(err_msg)
@@ -208,24 +211,38 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):
                )  # Keep original exception type name for developer debugging
                return None

+    async def query_self(self) -> Optional[Dict]:
+        """
+        Query self user info to check login state
+        Returns:
+            Dict: User info if logged in, None otherwise
+        """
+        uri = "/api/sns/web/v1/user/selfinfo"
+        headers = await self._pre_headers(uri, params={})
+        async with httpx.AsyncClient(proxy=self.proxy) as client:
+            response = await client.get(f"{self._host}{uri}", headers=headers)
+            if response.status_code == 200:
+                return response.json()
+        return None
+
    async def pong(self) -> bool:
        """
-        Check if login state is still valid
+        Check if login state is still valid by querying self user info
        Returns:
-
+            bool: True if logged in, False otherwise
        """
-        """get a note to check if login state is ok"""
-        utils.logger.info("[XiaoHongShuClient.pong] Begin to pong xhs...")
+        utils.logger.info("[XiaoHongShuClient.pong] Begin to check login state...")
        ping_flag = False
        try:
-            note_card: Dict = await self.get_note_by_keyword(keyword="Xiaohongshu")
-            if note_card.get("items"):
+            self_info: Dict = await self.query_self()
+            if self_info and self_info.get("data", {}).get("result", {}).get("success"):
                ping_flag = True
        except Exception as e:
            utils.logger.error(
-                f"[XiaoHongShuClient.pong] Ping xhs failed: {e}, and try to login again..."
+                f"[XiaoHongShuClient.pong] Check login state failed: {e}, and try to login again..."
            )
            ping_flag = False
+        utils.logger.info(f"[XiaoHongShuClient.pong] Login state result: {ping_flag}")
        return ping_flag

    async def update_cookies(self, browser_context: BrowserContext):
@@ -443,44 +460,61 @@ class XiaoHongShuClient(AbstractApiClient, ProxyRefreshMixin):

        result = []
        for comment in comments:
-            note_id = comment.get("note_id")
-            sub_comments = comment.get("sub_comments")
-            if sub_comments and callback:
-                await callback(note_id, sub_comments)
+            try:
+                note_id = comment.get("note_id")
+                sub_comments = comment.get("sub_comments")
+                if sub_comments and callback:
+                    await callback(note_id, sub_comments)

-            sub_comment_has_more = comment.get("sub_comment_has_more")
-            if not sub_comment_has_more:
-                continue
-
-            root_comment_id = comment.get("id")
-            sub_comment_cursor = comment.get("sub_comment_cursor")
-
-            while sub_comment_has_more:
-                comments_res = await self.get_note_sub_comments(
-                    note_id=note_id,
-                    root_comment_id=root_comment_id,
-                    xsec_token=xsec_token,
-                    num=10,
-                    cursor=sub_comment_cursor,
-                )
-
-                if comments_res is None:
-                    utils.logger.info(
-                        f"[XiaoHongShuClient.get_comments_all_sub_comments] No response found for note_id: {note_id}"
-                    )
+                sub_comment_has_more = comment.get("sub_comment_has_more")
+                if not sub_comment_has_more:
                    continue
-                sub_comment_has_more = comments_res.get("has_more", False)
-                sub_comment_cursor = comments_res.get("cursor", "")
-                if "comments" not in comments_res:
-                    utils.logger.info(
-                        f"[XiaoHongShuClient.get_comments_all_sub_comments] No 'comments' key found in response: {comments_res}"
-                    )
-                    break
-                comments = comments_res["comments"]
-                if callback:
-                    await callback(note_id, comments)
-                await asyncio.sleep(crawl_interval)
-                result.extend(comments)
+
+                root_comment_id = comment.get("id")
+                sub_comment_cursor = comment.get("sub_comment_cursor")
+
+                while sub_comment_has_more:
+                    try:
+                        comments_res = await self.get_note_sub_comments(
+                            note_id=note_id,
+                            root_comment_id=root_comment_id,
+                            xsec_token=xsec_token,
+                            num=10,
+                            cursor=sub_comment_cursor,
+                        )
+
+                        if comments_res is None:
+                            utils.logger.info(
+                                f"[XiaoHongShuClient.get_comments_all_sub_comments] No response found for note_id: {note_id}"
+                            )
+                            break
+                        sub_comment_has_more = comments_res.get("has_more", False)
+                        sub_comment_cursor = comments_res.get("cursor", "")
+                        if "comments" not in comments_res:
+                            utils.logger.info(
+                                f"[XiaoHongShuClient.get_comments_all_sub_comments] No 'comments' key found in response: {comments_res}"
+                            )
+                            break
+                        comments = comments_res["comments"]
+                        if callback:
+                            await callback(note_id, comments)
+                        await asyncio.sleep(crawl_interval)
+                        result.extend(comments)
+                    except DataFetchError as e:
+                        utils.logger.warning(
+                            f"[XiaoHongShuClient.get_comments_all_sub_comments] Failed to get sub-comments for note_id: {note_id}, root_comment_id: {root_comment_id}, error: {e}. Skipping this comment's sub-comments."
+                        )
+                        break  # Break out of the sub-comment acquisition loop of the current comment and continue processing the next comment
+                    except Exception as e:
+                        utils.logger.error(
+                            f"[XiaoHongShuClient.get_comments_all_sub_comments] Unexpected error when getting sub-comments for note_id: {note_id}, root_comment_id: {root_comment_id}, error: {e}"
+                        )
+                        break
+            except Exception as e:
+                utils.logger.error(
+                    f"[XiaoHongShuClient.get_comments_all_sub_comments] Error processing comment: {comment.get('id', 'unknown')}, error: {e}. Continuing with next comment."
+                )
+                continue  # Continue to next comment
        return result

    async def get_creator_info(
--- a/media_platform/xhs/core.py
+++ b/media_platform/xhs/core.py
@@ -42,7 +42,7 @@ from tools.cdp_browser import CDPBrowserManager
 from var import crawler_type_var, source_keyword_var

 from .client import XiaoHongShuClient
-from .exception import DataFetchError
+from .exception import DataFetchError, NoteNotFoundError
 from .field import SearchSortType
 from .help import parse_note_info_from_note_url, parse_creator_info_from_url, get_search_id
 from .login import XiaoHongShuLogin
@@ -308,6 +308,9 @@ class XiaoHongShuCrawler(AbstractCrawler):

                return note_detail

+            except NoteNotFoundError as ex:
+                utils.logger.warning(f"[XiaoHongShuCrawler.get_note_detail_async_task] Note not found: {note_id}, {ex}")
+                return None
            except DataFetchError as ex:
                utils.logger.error(f"[XiaoHongShuCrawler.get_note_detail_async_task] Get note detail error: {ex}")
                return None
--- a/media_platform/xhs/exception.py
+++ b/media_platform/xhs/exception.py
@@ -27,3 +27,7 @@ class DataFetchError(RequestError):

 class IPBlockError(RequestError):
    """fetch so fast that the server block us ip"""
+
+
+class NoteNotFoundError(RequestError):
+    """Note does not exist or is abnormal"""
--- a/store/bilibili/bilibilli_store_media.py
+++ b/store/bilibili/bilibilli_store_media.py
@@ -28,10 +28,15 @@ import aiofiles

 from base.base_crawler import AbstractStoreImage, AbstractStoreVideo
 from tools import utils
+import config


 class BilibiliVideo(AbstractStoreVideo):
-    video_store_path: str = "data/bili/videos"
+    def __init__(self):
+        if config.SAVE_DATA_PATH:
+            self.video_store_path = f"{config.SAVE_DATA_PATH}/bili/videos"
+        else:
+            self.video_store_path = "data/bili/videos"

    async def store_video(self, video_content_item: Dict):
        """
--- a/store/douyin/douyin_store_media.py
+++ b/store/douyin/douyin_store_media.py
@@ -24,10 +24,15 @@ import aiofiles

 from base.base_crawler import AbstractStoreImage, AbstractStoreVideo
 from tools import utils
+import config


 class DouYinImage(AbstractStoreImage):
-    image_store_path: str = "data/douyin/images"
+    def __init__(self):
+        if config.SAVE_DATA_PATH:
+            self.image_store_path = f"{config.SAVE_DATA_PATH}/douyin/images"
+        else:
+            self.image_store_path = "data/douyin/images"

    async def store_image(self, image_content_item: Dict):
        """
@@ -74,7 +79,11 @@ class DouYinImage(AbstractStoreImage):


 class DouYinVideo(AbstractStoreVideo):
-    video_store_path: str = "data/douyin/videos"
+    def __init__(self):
+        if config.SAVE_DATA_PATH:
+            self.video_store_path = f"{config.SAVE_DATA_PATH}/douyin/videos"
+        else:
+            self.video_store_path = "data/douyin/videos"

    async def store_video(self, video_content_item: Dict):
        """
--- a/store/excel_store_base.py
+++ b/store/excel_store_base.py
@@ -46,6 +46,7 @@ except ImportError:

 from base.base_crawler import AbstractStore
 from tools import utils
+import config


 class ExcelStoreBase(AbstractStore):
@@ -111,7 +112,10 @@ class ExcelStoreBase(AbstractStore):
        self.crawler_type = crawler_type

        # Create data directory
-        self.data_dir = Path("data") / platform
+        if config.SAVE_DATA_PATH:
+            self.data_dir = Path(config.SAVE_DATA_PATH) / platform
+        else:
+            self.data_dir = Path("data") / platform
        self.data_dir.mkdir(parents=True, exist_ok=True)

        # Initialize workbook
--- a/store/weibo/init.py
+++ b/store/weibo/init.py
@@ -83,7 +83,7 @@ async def update_weibo_note(note_item: Dict):
    content_text = mblog.get("text")
    clean_text = re.sub(r"<.*?>", "", content_text)
    save_content_item = {
-        # 微博信息
+        # Weibo information
        "note_id": note_id,
        "content": clean_text,
        "create_time": utils.rfc2822_to_timestamp(mblog.get("created_at")),
@@ -95,7 +95,7 @@ async def update_weibo_note(note_item: Dict):
        "note_url": f"https://m.weibo.cn/detail/{note_id}",
        "ip_location": mblog.get("region_name", "").replace("发布于 ", ""),

-        # 用户信息
+        # User information
        "user_id": str(user_info.get("id")),
        "nickname": user_info.get("screen_name", ""),
        "gender": user_info.get("gender", ""),
@@ -151,7 +151,7 @@ async def update_weibo_note_comment(note_id: str, comment_item: Dict):
        "ip_location": comment_item.get("source", "").replace("来自", ""),
        "parent_comment_id": comment_item.get("rootid", ""),

-        # 用户信息
+        # User information
        "user_id": str(user_info.get("id")),
        "nickname": user_info.get("screen_name", ""),
        "gender": user_info.get("gender", ""),
--- a/store/weibo/weibo_store_media.py
+++ b/store/weibo/weibo_store_media.py
@@ -28,10 +28,15 @@ import aiofiles

 from base.base_crawler import AbstractStoreImage, AbstractStoreVideo
 from tools import utils
+import config


 class WeiboStoreImage(AbstractStoreImage):
-    image_store_path: str = "data/weibo/images"
+    def __init__(self):
+        if config.SAVE_DATA_PATH:
+            self.image_store_path = f"{config.SAVE_DATA_PATH}/weibo/images"
+        else:
+            self.image_store_path = "data/weibo/images"

    async def store_image(self, image_content_item: Dict):
        """
--- a/store/xhs/xhs_store_media.py
+++ b/store/xhs/xhs_store_media.py
@@ -28,10 +28,15 @@ import aiofiles

 from base.base_crawler import AbstractStoreImage, AbstractStoreVideo
 from tools import utils
+import config


 class XiaoHongShuImage(AbstractStoreImage):
-    image_store_path: str = "data/xhs/images"
+    def __init__(self):
+        if config.SAVE_DATA_PATH:
+            self.image_store_path = f"{config.SAVE_DATA_PATH}/xhs/images"
+        else:
+            self.image_store_path = "data/xhs/images"

    async def store_image(self, image_content_item: Dict):
        """
@@ -78,7 +83,11 @@ class XiaoHongShuImage(AbstractStoreImage):


 class XiaoHongShuVideo(AbstractStoreVideo):
-    video_store_path: str = "data/xhs/videos"
+    def __init__(self):
+        if config.SAVE_DATA_PATH:
+            self.video_store_path = f"{config.SAVE_DATA_PATH}/xhs/videos"
+        else:
+            self.video_store_path = "data/xhs/videos"

    async def store_video(self, video_content_item: Dict):
        """
--- a/store/zhihu/_store_impl.py
+++ b/store/zhihu/_store_impl.py
@@ -113,6 +113,8 @@ class ZhihuDbStoreImplement(AbstractStore):
                    if hasattr(existing_content, key):
                        setattr(existing_content, key, value)
            else:
+                if "add_ts" not in content_item:
+                    content_item["add_ts"] = utils.get_current_timestamp()
                new_content = ZhihuContent(**content_item)
                session.add(new_content)
            await session.commit()
@@ -133,6 +135,8 @@ class ZhihuDbStoreImplement(AbstractStore):
                    if hasattr(existing_comment, key):
                        setattr(existing_comment, key, value)
            else:
+                if "add_ts" not in comment_item:
+                    comment_item["add_ts"] = utils.get_current_timestamp()
                new_comment = ZhihuComment(**comment_item)
                session.add(new_comment)
            await session.commit()
@@ -153,6 +157,8 @@ class ZhihuDbStoreImplement(AbstractStore):
                    if hasattr(existing_creator, key):
                        setattr(existing_creator, key, value)
            else:
+                if "add_ts" not in creator:
+                    creator["add_ts"] = utils.get_current_timestamp()
                new_creator = ZhihuCreator(**creator)
                session.add(new_creator)
            await session.commit()
--- a/test/test_expiring_local_cache.py
+++ b/test/test_expiring_local_cache.py
@@ -20,7 +20,7 @@

 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
-# @Name    : 程序员阿江-Relakkes
+# @Name: Programmer Ajiang-Relakkes
 # @Time    : 2024/6/2 10:35
 # @Desc    :

--- a/test/test_redis_cache.py
+++ b/test/test_redis_cache.py
@@ -20,7 +20,7 @@

 # -*- coding: utf-8 -*-
 # @Author  : relakkes@gmail.com
-# @Name    : 程序员阿江-Relakkes
+# @Name: Programmer Ajiang-Relakkes
 # @Time    : 2024/6/2 19:54
 # @Desc    :

--- a/tools/async_file_writer.py
+++ b/tools/async_file_writer.py
@@ -35,7 +35,10 @@ class AsyncFileWriter:
        self.wordcloud_generator = AsyncWordCloudGenerator() if config.ENABLE_GET_WORDCLOUD else None

    def _get_file_path(self, file_type: str, item_type: str) -> str:
-        base_path = f"data/{self.platform}/{file_type}"
+        if config.SAVE_DATA_PATH:
+            base_path = f"{config.SAVE_DATA_PATH}/{self.platform}/{file_type}"
+        else:
+            base_path = f"data/{self.platform}/{file_type}"
        pathlib.Path(base_path).mkdir(parents=True, exist_ok=True)
        file_name = f"{self.crawler_type}_{item_type}_{utils.get_current_date()}.{file_type}"
        return f"{base_path}/{file_name}"
@@ -113,7 +116,10 @@ class AsyncFileWriter:
                return

            # Generate wordcloud
-            words_base_path = f"data/{self.platform}/words"
+            if config.SAVE_DATA_PATH:
+                words_base_path = f"{config.SAVE_DATA_PATH}/{self.platform}/words"
+            else:
+                words_base_path = f"data/{self.platform}/words"
            pathlib.Path(words_base_path).mkdir(parents=True, exist_ok=True)
            words_file_prefix = f"{words_base_path}/{self.crawler_type}_comments_{utils.get_current_date()}"

--- a/tools/crawler_util.py
+++ b/tools/crawler_util.py
@@ -180,11 +180,18 @@ def format_proxy_info(ip_proxy_info) -> Tuple[Optional[Dict], Optional[str]]:
    from proxy.proxy_ip_pool import IpInfoModel
    ip_proxy_info = cast(IpInfoModel, ip_proxy_info)

+    # Playwright proxy server should be in format "host:port" without protocol prefix
+    server = f"{ip_proxy_info.ip}:{ip_proxy_info.port}"
+    
    playwright_proxy = {
-        "server": f"{ip_proxy_info.protocol}{ip_proxy_info.ip}:{ip_proxy_info.port}",
-        "username": ip_proxy_info.user,
-        "password": ip_proxy_info.password,
+        "server": server,
    }
+    
+    # Only add username and password if they are not empty
+    if ip_proxy_info.user and ip_proxy_info.password:
+        playwright_proxy["username"] = ip_proxy_info.user
+        playwright_proxy["password"] = ip_proxy_info.password
+    
    # httpx 0.28.1 requires passing proxy URL string directly, not a dictionary
    if ip_proxy_info.user and ip_proxy_info.password:
        httpx_proxy = f"http://{ip_proxy_info.user}:{ip_proxy_info.password}@{ip_proxy_info.ip}:{ip_proxy_info.port}"
Author	SHA1	Message	Date
程序员阿江-Relakkes	13b6140f22	Merge pull request #831 from ouzhuowei/fix_redis_and_proxy 适配没有redisKeys和快代理没有账号密码的情况	2026-02-13 21:18:26 +08:00
ouzhuowei	279c293147	删除不必要的注释 Co-Authored-By: ouzhuowei <190020754@qq.com>	2026-02-13 09:54:10 +08:00
ouzhuowei	db47d0e6f4	适配没有redisKeys和快代理没有账号密码的情况 Co-Authored-By: ouzhuowei <190020754@qq.com>	2026-02-13 09:42:15 +08:00
程序员阿江(Relakkes)	d614ccf247	docs: translate comments and metadata to English Update Chinese comments, variable descriptions, and metadata across multiple configuration and core files to English. This improves codebase accessibility for international developers. Additionally, removed the sponsorship section from README files.	2026-02-12 05:30:11 +08:00
程序员阿江-Relakkes	257743b016	Merge pull request #828 from ouzhuowei/add_save_data_path 补充代理配置的arp	2026-02-12 04:47:25 +08:00
程序员阿江-Relakkes	dcaa11eeb9	Merge pull request #829 from ouzhuowei/update_sub_comment_error 处理子评论获取失败导致整个流程中断问题	2026-02-12 04:46:34 +08:00
ouzhuowei	e54463ac78	处理子评论获取失败导致整个流程中断问题 Co-Authored-By: ouzhuowei <190020754@qq.com>	2026-02-10 17:53:30 +08:00
ouzhuowei	212276bc30	Revert "新增日志存储逻辑" This reverts commit `30cf16af0c`. Co-Authored-By: ouzhuowei <190020754@qq.com>	2026-02-10 15:03:40 +08:00
ouzhuowei	30cf16af0c	新增日志存储逻辑 Co-Authored-By: ouzhuowei <190020754@qq.com>	2026-02-06 12:33:35 +08:00
ouzhuowei	80e9c866a0	Merge branch 'add_save_data_path' into add_log_config Co-Authored-By: ouzhuowei <190020754@qq.com>	2026-02-06 12:24:57 +08:00
ouzhuowei	90280a261a	补充代理配置的arp Co-Authored-By: ouzhuowei <190020754@qq.com>	2026-02-06 09:58:37 +08:00
程序员阿江-Relakkes	4ad065ce9a	Merge pull request #825 from ouzhuowei/add_save_data_path 新增数据保存路径,默认不指定则保存到data文件夹下	2026-02-04 18:03:22 +08:00
ouzhuowei	2a0d1fd69f	补充各平台的媒体存储文件路径适配 Co-Authored-By: ouzhuowei <190020754@qq.com>	2026-02-04 09:48:39 +08:00
程序员阿江(Relakkes)	c309871485	refactor(xhs): improve login state check logic	2026-02-03 20:49:46 +08:00
程序员阿江(Relakkes)	6625663bde	feat: #823	2026-02-03 20:40:15 +08:00
程序员阿江(Relakkes)	fb42ab5b60	fix: #826	2026-02-03 20:35:33 +08:00
ouzhuowei	7484156f02	新增数据保存路径,默认不指定则保存到data文件夹下 Co-Authored-By: ouzhuowei <190020754@qq.com>	2026-02-03 11:24:22 +08:00
程序员阿江(Relakkes)	413b5d9034	docs: fix README heading levels, sync Pro section across languages - Fix h3→h2 for standalone sections (交流群组, 赞助商展示, 成为赞助者, 其他) in README.md - Remove WebUI standalone heading (kept as collapsible only) - Remove WandouHTTP sponsor from EN/ES versions - Expand Pro section (remove <details> collapse) in EN/ES to match CN - Add Content Deconstruction Agent to Pro feature list in EN/ES Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 00:40:27 +08:00
程序员阿江(Relakkes)	dbbc2c7439	docs: update README.md	2026-02-02 20:25:51 +08:00